scispace - formally typeset
Search or ask a question

Showing papers by "Wayne Luk published in 2011"


Journal ArticleDOI
TL;DR: A deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching and a new routing strategy called k-step look ahead is introduced.
Abstract: Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffic. However, implementation of adaptive routing in a network-on-chip system is not trivial and is further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k-step look ahead is introduced. This new strategy can substantially reduce the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic-routing solution with minimal hardware overhead. Our results, based on a cycle-accurate simulator, demonstrate the effectiveness of the DP network, which outperforms both the deterministic and adaptive-routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP network is insignificant, based on the results obtained from the hardware implementations.

82 citations


Journal ArticleDOI
TL;DR: This work presents an embedded communication macro, a communication infrastructure that interconnects PR modules in a tiled PR region, which enables a flexible online-placement of PR modules.
Abstract: In partially reconfigurable architectures, system components can be dynamically loaded and unloaded allowing resources to be shared over time. Dynamic system components are represented by partial reconfiguration (PR) modules. In comparison to a static system, the design of a partially reconfigurable system requires additional design steps, such as partitioning the device resources into static and dynamic regions. We present the concept of tiled PR regions, which enables a flexible online-placement of PR modules. Dynamic reconfiguration requires a suitable communication infrastructure to interconnect the static and dynamic system components. We present an embedded communication macro, a communication infrastructure that interconnects PR modules in a tiled PR region. Efficient online-placement of PR modules depends not only on the placement algorithm, but also on design-time aspects such as the chosen synthesis regions of the PR modules. We propose a design method for selecting suitable synthesis regions for the PR modules aiming to optimize their placement at run-time.

59 citations


Proceedings ArticleDOI
01 Dec 2011
TL;DR: This paper presents a framework for reconfigurable hardware acceleration of these large-scale graph problems that are difficult to partition and require high-latency off-chip memory storage, and tolerates off- chip memory latency.
Abstract: In many application domains, data are represented using large graphs involving millions of vertices and edges. Graph analysis algorithms, such as finding short paths and isomorphic subgraphs, are largely dominated by memory latency. Large cluster-based computing platforms can process graphs efficiently if the graph data can be partitioned, and on a smaller scale partitioning can be used to allocate graphs to low-latency on-chip RAMs in reconfigurable devices. However, there are many graph classes, such as scale-free social networks, which lack the locality to make partitioning graph data an efficient solution to the latency problem and are far too large to fit in on-chip RAMs and caches. In this paper, we present a framework for reconfigurable hardware acceleration of these large-scale graph problems that are difficult to partition and require high-latency off-chip memory storage. Our reconfigurable architecture tolerates off-chip memory latency by using a memory crossbar that connects many parallel identical processing elements to shared off-chip memory, without a traditional cached memory hierarchy. Quantitative comparison between the software and hardware performance of a graphlet counting case-study shows that our hardware implementation outperforms a quad-core software implementation by 10 times for large graphs. This speedup includes all software and IO overhead required, and reduces execution time for this common bioinformatics algorithm from about 2 hours to just 12 minutes. These results demonstrate that our methodology for accelerating graph algorithms is a promising approach for efficient parallel graph processing.

50 citations


Book ChapterDOI
23 Mar 2011
TL;DR: This paper analyses two methods of organizing parallelism for the Smith-Waterman algorithm, and shows how they perform relative to peak performance when the amount of parallelism varies.
Abstract: This paper analyses two methods of organizing parallelism for the Smith-Waterman algorithm, and show how they perform relative to peak performance when the amount of parallelism varies. A novel systolic design is introduced, with a processing element optimized for computing the affine gap cost function. Our FPGA design is significantly more energy-efficient than GPU designs. For example, our design for the XC5VLX330T FPGA achieves around 16 GCUPS/W, while CPUs and GPUs have a power efficiency of lower than 0.5 GCUPS/W.

33 citations


Journal ArticleDOI
TL;DR: An FPGA-accelerated Asian option pricing solution, using a highly-optimised parallel Monte-Carlo architecture is proposed, and the proposed pipelined design is described parametrically, facilitating its re-use for different technologies.
Abstract: Arithmetic Asian options are financial derivatives which have the feature of path-dependency: they depend on the entire price path of the underlying asset, rather than just the instantaneous price. This path-dependency makes them difficult to price, as only computationally intensive Monte-Carlo methods can provide accurate prices. This paper proposes an FPGA-accelerated Asian option pricing solution, using a highly-optimised parallel Monte-Carlo architecture. The proposed pipelined design is described parametrically, facilitating its re-use for different technologies. An implementation of this architecture in a Virtex-5 xc5vlx330t FPGA at 200MHz is 313 times faster than a multi-threaded software implementation running on a Intel Xeon E5420 quad-core CPU at 2.5GHz; it is also 2.2 times faster than the Tesla C1060 GPU at 1.3 GHz.

31 citations


Journal ArticleDOI
TL;DR: The model relates the lookup-table size, the cluster size, and the number of inputs per cluster to the amount of logic that can be packed into each lookup- table and cluster, the number and depth of the circuit after technology mapping and clustering.
Abstract: This paper presents an analytical model that relates FPGA architectural parameters to the logic size and depth of an FPGA implementation. In particular, the model relates the lookup-table size, the cluster size, and the number of inputs per cluster to the amount of logic that can be packed into each lookup-table and cluster, the number of used inputs per cluster, and the depth of the circuit after technology mapping and clustering. Comparison to experimental results shows that our model has good accuracy. We illustrate how the model can be used in FPGA architectural investigations to complement the experimental approach. The model's accuracy, combined with the simple form of the equations, make them a powerful tool for FPGA architects to better understand and guide the development of future FPGA architectures.

28 citations


Proceedings ArticleDOI
05 Sep 2011
TL;DR: This work focuses on an FPGA accelerated cluster system coupled with a wireless network that achieves enhanced power efficiency while fulfilling thermal constraints in all nodes by applying the proposed inter-FPGA wireless network to the N-Body application.
Abstract: FPGA accelerators are capable of improving computation and energy efficiency of many applications targeting a cluster of machines. In this work, we focus on an FPGA accelerated cluster system coupled with a wireless network. Comparing with conventional Ethernet based approaches, the proposed system with wireless network enables a light-weight and efficient method for the FPGA devices to exchange information directly. Customisable monitoring facilities are developed to support changing a distributed application dynamically at run time. The N-Body simulation application is used to demonstrate the effectiveness and potential of the proposed system. Experiments show that this approach can achieve up to 4.2 times improvement in latency. By applying the proposed inter-FPGA wireless network to the N-Body application, we achieve enhanced power efficiency while fulfilling thermal constraints in all nodes.

27 citations


Journal ArticleDOI
TL;DR: The approach shows that, for N-body computation, the fastest design which involves 2 CPU cores, 10 FPGA cores and 40960 GPU threads, is 2 times faster than a design with only FPGAs while achieving better overall energy efficiency.
Abstract: Processing speed and energy efficiency are two of the most critical issues for computer systems. This paper presents a systematic approach for profiling the power and performance characteristics of application targeting heterogeneous multi-core computing platforms. Our approach enables rapid and automated design space exploration involving optimisation of workload distribution for systems with accelerators such as FPGAs and GPUs. We demonstrate that, with minor modification to the design, it is possible to estimate performance and power efficiency trade off to identify optimized workload distribution. Our approach shows that, for N-body computation, the fastest design which involves 2 CPU cores, 10 FPGA cores and 40960 GPU threads, is 2 times faster than a design with only FPGAs while achieving better overall energy efficiency.

26 citations


Book
26 Aug 2011
TL;DR: A new treatment of computer system design, particularly for System-on-Chip (SOC), which addresses the issues mentioned above and presents future challenges for system design and SOC possibilities.
Abstract: The next generation of computer system designers will be less concerned about details of processors and memories, and more concerned about the elements of a system tailored to particular applications. These designers will have a fundamental knowledge of processors and other elements in the system, but the success of their design will depend on the skills in making system-level tradeoffs that optimize the cost, performance and other attributes to meet application requirements. This book provides a new treatment of computer system design, particularly for System-on-Chip (SOC), which addresses the issues mentioned above. It begins with a global introduction, from the high-level view to the lowest common denominator (the chip itself), then moves on to the three main building blocks of an SOC (processor, memory, and interconnect). Next is an overview of what makes SOC unique (its customization ability and the applications that drive it). The final chapter presents future challenges for system design and SOC possibilities.

26 citations


Proceedings ArticleDOI
18 Jul 2011
TL;DR: The results show that the proposed approach can develop and manage a computing system for each application to adjust its power consumption with respect to the power supply while maximizing speed.
Abstract: Energy harvesting systems provide a promising alternative to battery-powered systems and create an opportunity for architecture and design method innovation for the exploitation of ambient energy source. In this paper, we propose a two-stage optimization approach to develop power adaptive computing systems which can efficiently use energy harvested from solar source. At design time, an SPMD (single process, multiple data) computation structure with multiple parallel processing units is generated, and a convex optimizer runs at run-time to decide how many processing units can operate simultaneously subject to the instant power supplied from the harvester. The approach is evaluated on three embedded applications. The results show that the proposed approach can develop and manage a computing system for each application to adjust its power consumption with respect to the power supply while maximizing speed. Compared to static systems without adaptability, our power adaptive computing system improves the harvested energy utilization efficiency up to 28.8%. These computation systems can be applied to distributed monitor networks to improve computation capability at nodes. In our experiments, the throughput per watt in a node with a ARM9 processor can be improved 19 times by adding the developed adaptive computing system to the node.

19 citations


Book ChapterDOI
01 Jan 2011
TL;DR: A programming language, LARA, will allow the exploration of alternative architectures and design patterns enabling the generation of flexible hardware cores that can be easily incorporated into larger multi-core designs, and the effectiveness of the proposed approach will be evaluated using partner-provided codes from the domain of audio processing and real-time avionics.
Abstract: The relentless increase in capacity of Field-Programmable Gate-Arrays (FPGAs) has made them vehicles of choice for both prototypes and final products requiring on-chip multi-core, heterogeneous and reconfigurable systems. Multiple cores can be embedded as hard- or soft-macros, have customizable instruction sets, multiple distributed RAMs and/or configurable interconnections. Their flexibility allows them to achieve orders of magnitude better performance than conventional computing systems via customization. Programming these systems, however, is extremely cumbersome and error-prone and as a result their true potential is only achieved very often at unreasonably high design efforts. This project covers developing, implementing and evaluating a novel compilation and synthesis system approach for FPGA-based platforms. We rely on Aspect-Oriented Specifications to convey critical domain knowledge to a mapping engine while preserving the advantages of a high-level imperative programming paradigm in early software development as well as program and application portability. We leverage Aspect-Oriented specifications and a set of transformations to generate an intermediate representation suitable to hardware mapping. A programming language, LARA, will allow the exploration of alternative architectures and design patterns enabling the generation of flexible hardware cores that can be easily incorporated into larger multi-core designs. We will evaluate the effectiveness of the proposed approach using partner-provided codes from the domain of audio processing and real-time avionics. We expect the technology developed in REFLECT to be integrated by our industrial partners, in particular by ACE, a leading compilation tool supplier for embedded systems, and by Honeywell, a worldwide solution supplier of embedded high-performance systems.

Proceedings ArticleDOI
01 Dec 2011
TL;DR: Constant Power Reconfigurable Computing is presented, a general and device-independent framework based on a closed-loop control system used to keep the power consumption constant for any reconfigurable computing design targeting FPGA implementation.
Abstract: We present Constant Power Reconfigurable Computing, a general and device-independent framework based on a closed-loop control system used to keep the power consumption constant for any reconfigurable computing design targeting FPGA implementation. We develop an on-chip power consumer, an on-chip power monitor and a proportional-integral-derivative controller with circuit primitives available in most commercial FPGAs. We demonstrate the effectiveness of the proposed methodology on a square-and-multiply exponentiation circuit implemented on a Spartan-6 LX45 FPGA board. By reducing the peak autocorrelation values by a factor of 2.7 on average, the proposed Constant Power Reconfigurable Computing approach decreases the information leaked by the power consumption of this system with only 26% area overhead and 28% power overhead.

Proceedings ArticleDOI
01 May 2011
TL;DR: A novel methodology for mixed- Precision comparison is introduced, which improves comparison performance by using reduced-precision data paths while maintaining accuracy by using high-preision data paths.
Abstract: Customisable data formats provide an opportunity for exploring trade-offs in accuracy and performance of reconfigurable systems. This paper introduces a novel methodology for mixed-precision comparison, which improves comparison performance by using reduced-precision data paths while maintaining accuracy by using high-precision data paths. Our methodology adopts reduced-precision data-paths for preliminary comparison, and high-precision data-paths when the accuracy for preliminary comparison is insufficient. We develop an analytical model for performance estimation of the proposed mixed-precision methodology. Optimisation based on integer linear programming is employed for determining the optimal precision and resource allocation for each of the data paths. The effectiveness of our approach is evaluated using a common collision detection problem. Performance gains of 4 to 7.3 times are obtained over baseline fixed-precision designs for the same FPGAs. With the help of the proposed mixed-precision methodology, our FPGA designs are 15.4 to 16.7 times faster than software running on multi-core CPUs with the same technology.

Proceedings ArticleDOI
01 May 2011
TL;DR: A framework for comparing the performance of numerical option pricing methods using FPGAs is proposed, taking into account both speed (time to solution) and accuracy (quality of solution), and examines how the speed-accuracy trade-off curve varies for each method.
Abstract: A number of different numerical methods for accelerating financial option pricing using FPGAs have recently been investigated, such as Monte-Carlo, finite-difference, quadrature, and binomial trees. However, these papers only compare acceleration of each method against the same method in software, and do not consider a more important practical question, which is to identify the method that provides the best FPGA performance for a given option pricing application, regardless of raw speed-up over software. This paper proposes a framework for comparing the performance of numerical option pricing methods using FPGAs, taking into account both speed (time to solution) and accuracy (quality of solution), and examines how the speed-accuracy trade-off curve varies for each method. We apply the framework to European and American option pricing problems using Virtex-4 parts, and show that the quadrature solver converges fastest for both European and American options, and is also the most accurate in terms of root mean squared error for European options. However, when very accurate American results are needed the finite-difference solver is the most efficient method. Our results also show that the Monte-Carlo solver is at least 100 times less accurate in log scale than those based on other pricing methodologies, this drawback outweighs its benefit of having large raw speed-ups found in previous papers.

Journal ArticleDOI
TL;DR: A programming framework for high performance clusters with various hardware accelerators that has been used to support physics simulation and financial application development and achieves significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.
Abstract: We describe a programming framework for high performance clusters with various hardware accelerators. In this framework, users can utilize the available heterogeneous resources productively and efficiently. The distributed application is highly modularized to support dynamic system configuration with changing types and number of the accelerators. Multiple layers of communication interface are introduced to reduce the overhead in both control messages and data transfers. Parallelism can be achieved by controlling the accelerators in various schemes through scheduling extension. The framework has been used to support physics simulation and financial application development. We achieve significant performance improvement on a 16-node cluster with FPGA and GPU accelerators.

Proceedings ArticleDOI
30 Nov 2011
TL;DR: An analytical optimisation approach that explores the benefit of specialised designs over a static one and the key to this approach is the performance and area estimation of kernels that is based on the parameters of arithmetic operators inside the kernel.
Abstract: This paper explores the reconfiguration of slowly changing constants in an explicit finite difference solver for option pricing. Numerical methods for option pricing, such as finite difference, are computationally very complex and can be aided by hardware acceleration. Such hardware implementations can be further improved by specialising the circuit for constants, and reconfiguring the circuit when the constants change. In this paper we demonstrate how this concept can be applied to the pricing of European and American options. We present an analytical optimisation approach that explores the benefit of specialised designs over a static one. The key to this approach is the performance and area estimation of kernels that is based on the parameters of arithmetic operators inside the kernel. This allows us to quickly explore several design options without building full designs. Our experimental results on a Xilinx XC6VLX760 FPGA show that with a partially reconfigurable design performance can be improved by a factor of 4.7 over a design without reconfiguration.

01 Jan 2011
TL;DR: This paper offers a means of achieving high performance by producing parallel architectures adapted both to the problem domain and to specific problem instances by exploiting user-customisable parallelism available in advanced reconfigurable devices such as Field-Programmable Gate Arrays.
Abstract: Parallel approaches to Inductive Logic Programming (ILP) are adopted to address the computational complexity in the learning process. Existing parallel ILP implementations build on conventional general-purpose processors. This paper describes a different approach, by exploiting user-customisable parallelism available in advanced reconfigurable devices such as Field-Programmable Gate Arrays (FPGAs). Our customisable parallel architecture for ILP has three elements: a customisable logic programming processor, a multi-processor for parallel hypothesis evaluation, and an architecture generation framework for creating such multi-processors. Our approach offers a means of achieving high performance by producing parallel architectures adapted both to the problem domain and to specific problem instances.

Proceedings ArticleDOI
01 Dec 2011
TL;DR: A novel approach for verifying the implementation of an application program for a customized soft-processor, based on the ACL2 theorem prover, is proposed, showing how processors with different custom instructions and with different number of pipelined stages can be automatically produced and verified.
Abstract: Soft-processors, instruction processors implemented in FPGA technology, are often customizable to support domain-specific optimization. However the correctness of customized soft-processors, executing the associated machine code, is often not obvious. This paper proposes a novel approach for verifying the implementation of an application program for a customized soft-processor, based on the ACL2 theorem prover. The correctness proof involves verifying a machine code program executing on the target hardware device against a high-level specification of the application program. We illustrate the proposed approach with several case studies, showing how processors with different custom instructions and with different number of pipelined stages can be automatically produced and verified; such processors have a range of trade-offs in performance, size, power and energy consumption to meet different requirements.

Proceedings ArticleDOI
18 Jul 2011
TL;DR: This work is the first hardware architecture for the Lucas test based on the binary Jacobi algorithm and the fastest 45 nm ASIC implementation is 3.6 times faster and 400 times more energy efficient than the optimised software implementation in comparable technology.
Abstract: We present our parametric hardware architecture of the NIST approved Lucas probabilistic primality test. To our knowledge, our work is the first hardware architecture for the Lucas test. Our main contributions are a hardware architecture for calculating the Jacobi symbol based on the binary Jacobi algorithm, a pipelined modular add-shift module for calculating the Lucas sequences, methods for dependence analysis and for scheduling of the Lucas sequences computation. Our architecture implemented on a Virtex-5 FPGA is 30% slower but 3 times more energy efficient than the software version running on a Intel Xeon W3505. Our fastest 45 nm ASIC implementation is 3.6 times faster and 400 times more energy efficient than the optimised software implementation in comparable technology. The performance scaling of our architecture is much better than linear in area. Different speed/area/energy trade-offs are available through parametrization. The cell count and the power consumption of our ASIC implementations make them suitable for integration into an embedded system whereas our FPGA implementation would more likely benefit server applications.

Proceedings ArticleDOI
05 Sep 2011
TL;DR: A unifying framework for describing and automatically implementing financial explicit finite difference procedures in reconfigurable hardware, allowing parallelised and pipelined implementations to be created from high-level mathematical expressions is presented.
Abstract: Explicit finite difference method is widely used in finance for pricing many kinds of options. Its regular computational pattern makes it an ideal candidate for acceleration using reconfigurable hardware. However, because the corresponding hardware designs must be optimised both for the specific option and for the target platform, it is challenging and time consuming to develop designs efficiently and productively. This paper presents a unifying framework for describing and automatically implementing financial explicit finite difference procedures in reconfigurable hardware, allowing parallelised and pipelined implementations to be created from high-level mathematical expressions. The proposed framework is demonstrated using three option pricing problems. Our results show that an implementation from our framework targeting a Virtex-6 device at 310MHz is more than 24 times faster than a software implementation fully optimised by the Intel compiler on a four-core Xeron CPU at 2.66GHz. In addition, the latency of the FPGA solvers is up to 90 times lower than the corresponding software solvers.

Proceedings ArticleDOI
11 Sep 2011
TL;DR: The proposed framework can be used to improve the scalability of a reconfigurable cluster by involving more nodes in a single application and show high efficiency data throughput for both large and small data volumes, as well as low communication overhead.
Abstract: Computer clusters equipped with reconfigurable accelerators have shown promise in high performance computing. This paper explores novel ways of customising data communication between accelerator nodes, which is often a bottleneck when scaling up the cluster size. Based on the direct connection of high speed serial links between advanced reconfigurable devices, we develop and evaluate CusComNet, a scalable, flexible and efficient communication framework. The CusComNet framework is built around customisable, packet-based communication and supports three main types of customisation: packet protocol customisation, system-level customisation, and prioritised communication customisation. A performance model for estimating CusComNet's communication latency is proposed and demonstrated. Our framework is applied to a 16-node cluster, each node of which contains an FPGA accelerator which can be connected directly to other FPGA accelerators. The proposed framework can be used to improve the scalability of a reconfigurable cluster by involving more nodes in a single application. Performance measurements show high efficiency data throughput for both large and small data volumes, as well as low communication overhead.

Book ChapterDOI
01 Jan 2011
TL;DR: This paper describes a method of developing energy-efficient run-time reconfigurable hardware designs by systematically deactivate part of the hardware using word-length optimisation techniques, and then select the most optimal reconfiguration strategy: multiple bitstream reconfigured or component multiplexing.
Abstract: This paper describes a method of developing energy-efficient run-time reconfigurable hardware designs. The key idea is to systematically deactivate part of the hardware using word-length optimisation techniques, and then select the most optimal reconfiguration strategy: multiple bitstream reconfiguration or component multiplexing. When multiplexing between different parts of the circuit, it may not always be possible to gate the clock to the unwanted components in FPGAs. Different methods of achieving the same effect while minimising the area used for the control logic are investigated. A model is used to determine the conditions under which reconfiguring the bitstream is more energy-efficient than multiplexing part of the design, based on power measurements taken on 130nm and 90nm devices. Various case studies, such as ray tracing, B–Splines, vector multiplication and inner product are used to illustrate this approach.


Book ChapterDOI
01 Jan 2011
TL;DR: It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost.
Abstract: A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described The approach involves a novel design space exploration tool and a parameterisable system model Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architecture's critical paths is then described The approach and steps are demonstrated using the architecture of a graphics processor We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost

Book ChapterDOI
27 Jun 2011

Proceedings ArticleDOI
01 Dec 2011
TL;DR: Results show that the heterogeneous computing system with appropriate workload allocation provides high energy efficiency with peak value at 1.1 GFLOPs/W and reduces power consumption by 56.54%; and that workload allocation schemes are significantly different with regards to different system metrics.
Abstract: In this work, we explore heterogeneous computing hardware, including CPUs, GPUs and FPGAs, for scientific computing. We study system metrics such as throughput, energy efficiency and temperature, and formulate the problem of workload allocation among computing hardware in mathematical models with regards to the three metrics. The workload allocation approach is evaluated using Linpack on a hardware platform containing one CPU, one GPU and one FPGA. Results show that the heterogeneous computing system with appropriate workload allocation provides high energy efficiency with peak value at 1.1 GFLOPs/W and reduces power consumption by 56.54%; and that workload allocation schemes are significantly different with regards to different system metrics.

Proceedings ArticleDOI
27 Feb 2011
TL;DR: A comprehensive study of a systolic design for Smith-Waterman algorithm is presented, with specific focus on enhancing parallelism and on optimizing the total size of memory and circuits; in particular, efficient realizations for compressing score matrices and for reducing affine gap cost functions are developed.
Abstract: The Smith-Waterman algorithm is a key technique for comparing genetic sequences. This paper presents a comprehensive study of a systolic design for Smith-Waterman algorithm. It is parameterized in terms of the sequence length, the amount of parallelism, and the number of FPGAs. Two methods of organizing the parallelism, the line-based and the lattice-based methods, are introduced. Our analytical treatment reveals how these two methods perform relative to peak performance when the level of parallelism varies. A novel systolic design is then described, showing how the parametric description can be effectively implemented, with specific focus on enhancing parallelism and on optimizing the total size of memory and circuits; in particular, we develop efficient realizations for compressing score matrices and for reducing affine gap cost functions. Promising results have been achieved showing, for example, a single XC5VLX330 FPGA at 131MHz can be three times faster than a platform with two NVIDIA GTX295 at 1242MHz.

Book ChapterDOI
27 Jun 2011
TL;DR: This chapter looks at the cache system to understand how it operates and how it is designed, and first the main memory problem, first the on-die memory and then the conventional DRAM design.
Abstract: Memory design is the key to system design. The memory system is often the most costly (in terms of area or dies) part of the system and it largely determines the performance. Regardless of the processors and the interconnect, the application cannot be executed any faster than the memory system, which provides the instructions and the operands. Memory design involves a number of considerations. The primary consideration is the application requirements: the operating system, the size and the variability of the application processes. This largely determines the size of memory and how the memory will be addressed: real or virtual. Figure 4.1 is an outline for memory design, while Table 4.1 compares the area for different memory technologies. In this chapter we first look at the cache system to understand how it operates and how it is designed. After that we consider the main memory problem, first the on-die memory and then the conventional DRAM design. As part of the design of large memory systems we look at multiple memory modules, interleaving and memory system performance. Figure 4.2 shows the various types of memory that can be integrated into an SOC

Proceedings ArticleDOI
13 Apr 2011
TL;DR: A security-aware cache targeting field-programmable gate array (FPGA) technology based on an architecture with a remapping table, which provides resilience against side-channel timing attacks and can be optimised for FPGA resources by an index decoder with content addressable memory structure.
Abstract: This paper describes a security-aware cache targeting field-programmable gate array (FPGA) technology. Our design is based on an architecture with a remapping table, which provides resilience against side-channel timing attacks. We show how this cache design can be optimised for FPGA resources by an index decoder with content addressable memory structure, which can be customized to meet various requirements. We show, for the first time, how our security-aware cache can be included in the Leon 3 processor, and its performance and resource usage are evaluated.

Book ChapterDOI
01 Jan 2011
TL;DR: The hArtes toolchain provides (semi) automatic support to the designer for this mapping effort and allows for easy design space exploration to find the best mapping, given hardware availability and real time execution constraints.
Abstract: When targeting heterogeneous, multi-core platforms, system and application developers are not only confronted with the challenge of choosing the best hardware configuration for the application they need to map, but also the application has to be modified such that certain parts are executed on the most appropriate hardware component. The hArtes toolchain provides (semi) automatic support to the designer for this mapping effort. A hardware platform was specifically designed for the project, which consists of an ARM processor, a DSP and an FPGA. The ­toolchain, targeting this platform but potentially targeting any similar system, has been tested and validated on several computationally intensive applications and resulted in substantial speedups as well as drastically reduced development times. We report speedups of up to nine times compared against a pure ARM based execution, and mapping can be done in minutes. The toolchain thus allows for easy design space exploration to find the best mapping, given hardware availability and real time execution constraints.