scispace - formally typeset
Search or ask a question

Showing papers by "Wayne Luk published in 2013"


Journal Article•DOI•
01 Feb 2013
TL;DR: A novel motion parameterization scheme in polar coordinates is proposed to describe the transition of motion, thus allowing for direct manual control of the robot using standard interface devices with limited degrees of freedom.
Abstract: This paper presents a real-time control framework for a snake robot with hyper-kinematic redundancy under dynamic active constraints for minimally invasive surgery. A proximity query (PQ) formulation is proposed to compute the deviation of the robot motion from predefined anatomical constraints. The proposed method is generic and can be applied to any snake robot represented as a set of control vertices. The proposed PQ formulation is implemented on a graphic processing unit, allowing for fast updates over 1 kHz. We also demonstrate that the robot joint space can be characterized into lower dimensional space for smooth articulation. A novel motion parameterization scheme in polar coordinates is proposed to describe the transition of motion, thus allowing for direct manual control of the robot using standard interface devices with limited degrees of freedom. Under the proposed framework, the correct alignment between the visual and motor axes is ensured, and haptic guidance is provided to prevent excessive force applied to the tissue by the robot body. A resistance force is further incorporated to enhance smooth pursuit movement matched to the dynamic response and actuation limit of the robot. To demonstrate the practical value of the proposed platform with enhanced ergonomic control, detailed quantitative performance evaluation was conducted on a group of subjects performing simulated intraluminal and intracavity endoscopic tasks.

83 citations


Proceedings Article•DOI•
28 Apr 2013
TL;DR: This paper proposes a novel approach, based on reconfigurable computing technology, for accelerating short read mapping, where the positions of millions of short reads are located relative to a known reference sequence.
Abstract: Recent improvements in the throughput of nextgeneration DNA sequencing machines poses a great computational challenge in analysing the massive quantities of data produced. This paper proposes a novel approach, based on reconfigurable computing technology, for accelerating short read mapping, where the positions of millions of short reads are located relative to a known reference sequence. Our approach consists of two key components: an exact string matcher for the bulk of the alignment process, and an approximate string matcher for the remaining cases. We characterise interesting regions of the design space, including homogeneous, heterogeneous and run-time reconfigurable designs and provide back of envelope estimations of the corresponding performance. We show that a particular implementation of this architecture targeting a single FPGA can be up to 293 times faster than BWA on an Intel X5650 CPU, and 134 times faster than SOAP3 on an NVIDIA GTX 580 GPU.

39 citations


Journal Article•DOI•
TL;DR: This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise xor operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths, with quality comparable to the best software generators.
Abstract: Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world designs. This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise xor operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths. This provides a good resource-quality balance compared to previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators. The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper, allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper, backed up by an online set of very high speed integrated circuit hardware description language (VHDL) generators and test benches.

37 citations


Book Chapter•DOI•
25 Mar 2013
TL;DR: This paper explores the use of reconfigurable hardware to accelerate the short read mapping problem, and finds that an implementation targeting the MaxWorkstation performs considerably faster and more energy efficient than current CPU and GPU based software aligners.
Abstract: Next generation DNA sequencing machines have been improving at an exceptional rate; the subsequent analysis of the generated sequenced data has become a bottleneck in current systems. This paper explores the use of reconfigurable hardware to accelerate the short read mapping problem, where the positions of millions of short DNA sequences are located relative to a known reference sequence. The proposed design comprises of an alignment processor based on a backtracking variation of the FM-index algorithm. The design represents a full solution to the short read mapping problem, capable of efficient exact and approximate alignment. We use reconfigurable hardware to accelerate the design and find that an implementation targeting the MaxWorkstation performs considerably faster and more energy efficient than current CPU and GPU based software aligners.

37 citations


Journal Article•DOI•
TL;DR: SPREAD is a reconfigurable architecture with a unified software/hardware thread interface and high throughput point-to-point streaming structure that enhances hardware efficiency while simplifying the development of streaming applications for partially reconfigured systems.
Abstract: Partially reconfigurable systems are promising computing platforms for streaming applications, which demand both hardware efficiency and reconfigurable flexibility. To realize the full potential of these systems, a streaming-based partially reconfigurable architecture and unified software/hardware multithreaded programming model (SPREAD) is presented in this paper. SPREAD is a reconfigurable architecture with a unified software/hardware thread interface and high throughput point-to-point streaming structure. It supports dynamic computing resource allocation, runtime software/hardware switching, and streaming-based multithreaded management at the operating system level. SPREAD is designed to provide programmers of streaming applications with a unified view of threads, allowing them to exploit thread, data, and pipeline parallelism; it enhances hardware efficiency while simplifying the development of streaming applications for partially reconfigurable systems. Experimental results targeting cryptography applications demonstrate the feasibility and superior performance of SPREAD. Moreover, the parallelized Advanced Encryption Standard (AES), Data Encryption Standard (DES), and Triple DES (3DES) hardware threads on field-programmable gate arrays show 1.61-4.59 times higher power efficiency than their implementations on state-of-the-art graphics processing units.

34 citations


Proceedings Article•DOI•
24 Oct 2013
TL;DR: A hybrid CPU-FPGA algorithm that applies single and multiple FPGAs to compute the upwind stencil for the global shallow water equations is proposed, which can perform 428 floating-point and 235 fixed-point operations per cycle.
Abstract: One of the most essential and challenging components in a climate system model is the atmospheric model. To solve the multi-physical atmospheric equations, developers have to face extremely complex stencil kernels. In this paper, we propose a hybrid CPU-FPGA algorithm that applies single and multiple FPGAs to compute the upwind stencil for the global shallow water equations. Through mixed-precision arithmetic, we manage to build a fully pipelined upwind stencil design on a single FPGA, which can perform 428 floating-point and 235 fixed-point operations per cycle. The CPU-FPGA algorithm using one Virtex-6 FPGA provides 100 times speedup over a 6-core CPU and 4 times speedup over a hybrid node with 12 CPU cores and a Fermi GPU card. The algorithm using four FPGAs provides 330 times speedup over a 6-core CPU; it is also 14 times faster and 9 times more power efficient than the hybrid CPU-GPU node.

34 citations


Journal Article•DOI•
TL;DR: The use of techniques inspired by aspect-oriented technology and scripting languages for defining and exploring hardware compilation strategies are described and the results show the impact of various strategies when targeting custom hardware and expose the complexities in devising these strategies, hence highlighting the productivity benefits of this approach.

19 citations


Book Chapter•DOI•
25 Mar 2013
TL;DR: A method to adapt the number of particles dynamically and utilise the run-time reconfigurability of the FPGA for reduced power and energy consumption and it shows that the proposed adaptive particle filter can reduce up to 99% of computation time.
Abstract: This paper presents a heterogeneous reconfigurable system for real-time applications applying particle filters. The system consists of an FPGA and a multi-threaded CPU. We propose a method to adapt the number of particles dynamically and utilise the run-time reconfigurability of the FPGA for reduced power and energy consumption. An application is developed which involves simultaneous mobile robot localisation and people tracking. It shows that the proposed adaptive particle filter can reduce up to 99% of computation time. Using run-time reconfiguration, we achieve 34% reduction in idle power and save 26-34% of system energy. Our proposed system is up to 7.39 times faster and 3.65 times more energy efficient than the Intel Xeon X5650 CPU with 12 threads, and 1.3 times faster and 2.13 times more energy efficient than an NVIDIA Tesla C2070 GPU.

19 citations


Proceedings Article•DOI•
Ce Guo1, Wayne Luk1•
24 Oct 2013
TL;DR: This work presents a log-likelihood evaluation strategy which is suitable for hardware acceleration, and designs and optimises a pipelined engine based on this strategy, which is shown to be up to 72 times faster than a single-core CPU, and 10 times better than an 8-core CPUs.
Abstract: Hawkes processes are point processes that can be used to build probabilistic models to describe and predict occurrence patterns of random events. They are widely used in high-frequency trading, seismic analysis and neuroscience. A critical numerical calculation in Hawkes process models is parameter estimation, which is used to fit a Hawkes process model to a data set. The parameter estimation problem can be solved by searching for a parameter set that maximises the log-likelihood. A core operation of this search process, the log-likelihood evaluation, is computationally demanding if the number of data points is large. To accelerate the computation, we present a log-likelihood evaluation strategy which is suitable for hardware acceleration. We then design and optimise a pipelined engine based on our proposed strategy. In the experiments, an FPGA-based implementation of the proposed engine is shown to be up to 72 times faster than a single-core CPU, and 10 times faster than an 8-core CPU.

16 citations


Proceedings Article•DOI•
Xinyu Niu1, Thomas C. P. Chau1, Qiwei Jin1, Wayne Luk1, Qiang Liu2 •
28 Apr 2013
TL;DR: Reconfiguration Data Flow Graph is introduced, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation.
Abstract: A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Reconfiguration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.31 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.

13 citations


Proceedings Article•DOI•
24 Oct 2013
TL;DR: A scalable communication model to schedule communication operations based on available resources and algorithm properties is proposed to solve the problem of scalability of stencil algorithms in large-scale clusters.
Abstract: Stencil-based algorithms are known to be computationally intensive and used in many scientific applications. The scalability of stencil algorithms in large-scale clusters is limited by data dependency between distributed workload. This paper proposes a scalable communication model to schedule communication operations based on available resources and algorithm properties. Experimental results from the Maxeler MPC-C500 computing system with four Virtex-6 SX475T FPGAs demonstrate linear speedup.

Proceedings Article•DOI•
01 Dec 2013
TL;DR: This work proposes a new general approach for accelerating suffix-trie based short read alignment methods using reconfigurable hardware and finds that in this particular implementation the alignment time can be up to 14.7 and 18.1 times faster than SOAP2 and BWA run on dual Intel X5650 CPUs.
Abstract: Recent trends in the cost and demand of next generation DNA sequencing (NGS) has revealed a great computational challenge in analysing the massive quantities of sequenced data produced Given that the projected increase in sequenced data far outstrips Moore's Law, the current technologies used to handle the data are likely to become insufficient This paper explores the use of reconfigurable hardware in accelerating short read alignment In this application, the positions of millions of short DNA sequences (called reads) are located in a known reference genome This work proposes a new general approach for accelerating suffix-trie based short read alignment methods using reconfigurable hardware In the proposed approach, specialised filters are designed to align short reads to a reference genome with a specific edit distance The filters are arranged in a pipeline according to increasing edit distance, where short reads unable to be aligned by a given filter are forwarded to the next filter in the pipeline for further processing Run-time reconfiguration is used to fully populate an accelerator device with each filter in the pipeline in turn In our implementation a single FPGA is populated with specialised filters based on a novel bidirectional backtracking version of the FM-index, and it is found that in this particular implementation the alignment time can be up to 147 and 181 times faster than SOAP2 and BWA run on dual Intel X5650 CPUs

Proceedings Article•DOI•
01 Oct 2013
TL;DR: The Forward Financial Framework allows the computational finance problem specification to be captured precisely yet succinctly, then automatically creates efficient implementations for heterogeneous platforms, utilising both multi-core CPUs and FPGAs.
Abstract: This paper presents the Forward Financial Framework (F3), an application framework for describing and implementing forward looking financial computations on high performance, heterogeneous platforms. F3 allows the computational finance problem specification to be captured precisely yet succinctly, then automatically creates efficient implementations for heterogeneous platforms, utilising both multi-core CPUs and FPGAs. The automatic mapping of a high-level problem description to a low-level heterogeneous implementation is possible due to the domain-specific knowledge which is built in F3, along with a software architecture that allows for additional domain knowledge and rules to be added to the framework. Currently the system is able to utilise domain-knowledge of the run-time characteristics of pricing tasks to partition pricing problems and allocate them to appropriate compute resources, and to exploit relationships between financial instruments to balance computation against communication. The versatility of the framework is demonstrated using a benchmark of option pricing problems, where F3 achieves comparable speed and energy efficiency to external manual implementations. Further, the domain-knowledge guided partitioning scheme suggests a partitioning of subtasks that is 13% faster than the average, while exploiting domain dependencies to reduce redundant computations results in an average gain in efficiency of 27%.

Journal Article•DOI•
TL;DR: The new method provides a ten times increase in performance over the fastest existing field-programmable gate array generation method, and also provides a five times improvement in performance per resource over the most efficient existing method.
Abstract: The multivariate Gaussian distribution is used to model random processes with distinct pair-wise correlations, such as stock prices that tend to rise and fall together. Multivariate Gaussian vectors with length n are usually produced by first generating a vector of n independent Gaussian samples, then multiplying with a correlation inducing matrix requiring O(n2) multiplications. This paper presents a method of generating vectors directly from the uniform distribution, removing the need for an expensive scalar Gaussian generator, and eliminating the need for any multipliers. The method relies only on small read-only memories and adders, and so can be implemented using only logic resources (lookup-tables and registers), saving multipliers, and block-memory resources for the numerical simulation that the multivariate generator is driving. The new method provides a ten times increase in performance (vectors/second) over the fastest existing field-programmable gate array generation method, and also provides a five times improvement in performance per resource over the most efficient existing method. Using this method, a single 400-MHz Virtex-5 FPGA can generate vectors ten times faster than an optimized implementation on a 1.2-GHz graphics processing unit, and a hundred times faster than vectorized software on a general purpose quad core 2.2-GHz processor.

Journal Article•DOI•
01 Jul 2013
TL;DR: This paper presents a parallel search parallel move approach to parallelise neighbourhood search algorithms on many-core platforms and develops and implements a parallel simulated annealing algorithm for solving the travelling salesman problem using an NVIDIA Tesla C2050 GPU platform.
Abstract: This paper presents a parallel search parallel move approach to parallelise neighbourhood search algorithms on many-core platforms. In this approach, a large number of searches are run concurrently and coordinated periodically. Iteratively, each search generates and evaluates multiple moves in parallel. The proposed approach can fully utilise the computing capability of many-core platforms under various platform specific constraints. A parallel simulated annealing algorithm for solving the travelling salesman problem is developed using the parallel search parallel move scheme and implemented on an NVIDIA Tesla C2050 GPU platform. We evaluate the performance of our approach against a multi-threaded CPU implementation on a server containing two Intel Xeon X5650 CPUs 12 cores in total. The experimental results of 20 benchmark problems show that the GPU implementation achieves 99 times speedup on average in solution space exploration speed. In terms of effectiveness, the GPU implementation is capable of finding good solutions 39.5 times faster or with 21.7% solution quality improvement given the same searching time.

Proceedings Article•DOI•
05 Jun 2013
TL;DR: This paper proposes a novel hardware compilation approach targeting dataflow designs based on aspect-oriented programming to decouple design development from design optimisation, thus improving portability and developer productivity while enabling automated exploration of design trade-offs to enhance performance.
Abstract: This paper proposes a novel hardware compilation approach targeting dataflow designs. This approach is based on aspect-oriented programming to decouple design development from design optimisation, thus improving portability and developer productivity while enabling automated exploration of design trade-offs to enhance performance. We introduce FAST, a language for specifying dataflow designs that supports our approach. Optimisation strategies for the generated designs are specified in FAST, making use of facilities in the domain-specific aspect-oriented language, LARA. Our approach is demonstrated by implementing various seismic imaging designs for ReverseTime Migration (RTM), which have performance comparable to state-of-the-art FPGA implementations while being produced with improved developer productivity.

Proceedings Article•DOI•
01 Dec 2013
TL;DR: Experimental results show that high throughput and significant resource utilisation can be achieved with Dynamic Stencil designs, which can dynamically scale into nodes becoming available during their execution.
Abstract: Computing nodes in reconfigurable clusters are occupied and released by applications during their execution. At compile time, application developers are not aware of the amount of resources available at run time. Dynamic Stencil is an approach that optimises stencil applications by constructing scalable designs which can adapt to available run-time resources in a reconfigurable cluster. This approach has three stages: compile-time optimisation, run-time initialisation, and run-time scaling, and can be used in developing effective servers for stencil computation. Reverse-Time Migration, a high-performance stencil application, is developed with the proposed approach. Experimental results show that high throughput and significant resource utilisation can be achieved with Dynamic Stencil designs, which can dynamically scale into nodes becoming available during their execution. When statically optimised and initialised, the Dynamic Stencil design is 1.8 to 88 times faster and 1.7 to 92 times more power efficient than reference CPU, GPU, MaxGenFD, Blue Gene/P, Blue Gene/Q and Cray XK6 designs; when dynamically scaled, resource utilisation of the design reaches 91%, which is 1.8 to 2.3 times higher than their static counterparts.

Book Chapter•DOI•
25 Mar 2013
TL;DR: A novel technique that uses meta- heuristics and machine learning to automate the optimization of design parameters for reconfigurable designs and shows that the number of benchmark evaluations can be reduced by up to 85% compared to previously performed manual optimization.
Abstract: This paper presents a novel technique that uses meta- heuristics and machine learning to automate the optimization of design parameters for reconfigurable designs. Traditionally, such an optimization involves manual application analysis as well as model and parameter space exploration tool creation. We develop a Machine Learning Optimizer (MLO) to automate this process. From a number of benchmark executions, we automatically derive the characteristics of the parameter space and create a surrogate fitness function through regression and classification. Based on this surrogate model, design parameters are optimized with meta-heuristics. We evaluate our approach using two case studies, showing that the number of benchmark evaluations can be reduced by up to 85% compared to previously performed manual optimization.

Proceedings Article•DOI•
01 Dec 2013
TL;DR: This paper derives a PQ formulation which can support non-convex objects represented by meshes or cloud points and optimise the proposed PQ for reconfigurable hardware by function transformation and reduced precision, resulting in a novel data structure and memory architecture for data streaming while maintaining the accuracy of results.
Abstract: Proximity Query (PQ) is a process to calculate the relative placement of objects. It is a critical task for many applications such as robot motion planning, but it is often too computationally demanding for real-time applications, particularly those involving human-robot collaborative control. This paper derives a PQ formulation which can support non-convex objects represented by meshes or cloud points. We optimise the proposed PQ for reconfigurable hardware by function transformation and reduced precision, resulting in a novel data structure and memory architecture for data streaming while maintaining the accuracy of results. Run-time reconfiguration is adopted for dynamic precision optimisation. Experimental results show that our optimised PQ implementation on a reconfigurable platform with four FPGAs is 58 times faster than an optimised CPU implementation with 12 cores, 9 times faster than a GPU, and 3 times faster than a double precision implementation with four FPGAs.

Proceedings Article•DOI•
01 Jan 2013
TL;DR: The new method is shown to have a 98.5% computational time saving over that of a previous sequential implementation, with no degradation in path quality, and is enough to allow real-time implementation.
Abstract: This paper presents the parallelisation of a Sequential Monte Carlo algorithm, and the associated changes required when applied to the problem of conflict resolution and aircraft trajectory control in air traffic management. The target problem is non-linear, constrained, non-convex and multi-agent. The new method is shown to have a 98.5% computational time saving over that of a previous sequential implementation, with no degradation in path quality. The computation saving is enough to allow real-time implementation.

Journal Article•DOI•
TL;DR: A novel mixed integer linear programming formalisation is used to assign code sections from parallel tasks to share computational components with the optimal trade-off between acceleration from component specialism and serialisation delay to achieve faster execution times.

Proceedings Article•DOI•
01 Dec 2013
TL;DR: This work presents a reconfigurable accelerated approach for market feed arbitration operating at the network level, model multiple-core arbitration and explore the scalability and performance improvements within and between cores.
Abstract: Messages are transmitted from financial exchanges to update their members about changes in the market. As UDP packets are used for message transmission, members subscribe to two identical message feeds from the exchange to lower the risk of message loss or delay. As financial trades can be time sensitive, low latency arbitration between these market data feeds is of particular importance. Members must either provide generic arbitration for all of their financial applications, increasing latency, or arbitrate within each application which wastes resources and scales poorly. We present a reconfigurable accelerated approach for market feed arbitration operating at the network level. Multiple arbitrators can operate within a single FPGA to output customised feeds to downstream financial applications. Application-specific customisations are supported by each core, allowing different market feed messaging protocols, windowing operations and message buffering parameters. We model multiple-core arbitration and explore the scalability and performance improvements within and between cores. We demonstrate our design within a Xilinx Virtex-6 FPGA using the NASDAQ TotalView-ITCH 4.1 messaging standard. Our implementation operates at 16Gbps throughput, and with resource sharing, supports 12 independent cores, 33% more than simple core replication. A 56ns (7 clock cycles) windowing latency is achieved, 2.6 times lower than a hardware-accelerated CPU approach.

Proceedings Article•DOI•
19 Aug 2013
TL;DR: This work was supported by EPSRC (Engineering and Physical Sciences Research Council - UK) Grant No.
Abstract: This work was supported by EPSRC (Engineering and Physical Sciences Research Council - UK) Grant No. EP/G066477/1

Proceedings Article•DOI•
Xinyu Niu1, Thomas C. P. Chau1, Qiwei Jin1, Wayne Luk1, Qiang Liu2 •
11 Feb 2013
TL;DR: Configuration Data Flow Graph is introduced, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation, which approximate the theoretical performance by eliminating idle functions.
Abstract: A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Configuration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.61 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.

Proceedings Article•DOI•
28 Apr 2013
TL;DR: A Dataflow Engine (DFE) design to accelerate the GCM computation through four steps of optimization: recomposing the algorithm to be pipeline-friendly, removing unnecessary computation, sharing common computing results, and reducing the computing precision while maintaining the same level of accuracy for the computation results.
Abstract: The Gaussian Copula Model (GCM) plays an important role in the state-of-the-art financial analysis field for modeling the dependence of financial assets. However, the existing implementations of GCM are all computationallydemanding and time-consuming. In this paper, we propose a Dataflow Engine (DFE) design to accelerate the GCM computation. Specifically, a commonly used CPU-friendly GCM algorithm is converted into a fully-pipelined dataflow graph through four steps of optimization: recomposing the algorithm to be pipeline-friendly, removing unnecessary computation, sharing common computing results, and reducing the computing precision while maintaining the same level of accuracy for the computation results. The performance of the proposed DFE design is compared with three CPU-based implementations that are well-optimized. Experimental results show that our DFE solution not only generates fairly accurate result, but also achieves a maximum of 467x speedup over a single-thread CPU-based solution, 120x speedup over a multi-thread CPUbased solution, and 47x speedup over an MPI-based solution.

Book Chapter•DOI•
23 Sep 2013
TL;DR: This paper introduces a programming framework based on the theory of session types for safe and scalable parallel designs and outlines a proposal to integrate session programming with heterogeneous systems for efficient and communication-safe parallel applications by a combination of code generation and type checking.
Abstract: This paper introduces a programming framework based on the theory of session types for safe and scalable parallel designs. Session-based languages can offer a clear and tractable framework to describe communications between parallel components and guarantee communication-safety and deadlock-freedom by compile-time type checking and parallel MPI code generation. Many representative communication topologies such as ring or scatter-gather can be programmed and verified in session-based programming languages. We use a case study involving N-body simulation, dense and sparse matrix multiplication to illustrate the session-based programming style. Finally, we outline a proposal to integrate session programming with heterogeneous systems for efficient and communication-safe parallel applications by a combination of code generation and type checking.

Proceedings Article•DOI•
Qingyu Liu1, Yuchun Ma1, Yu Wang1, Wayne Luk2, Jinian Bian1 •
01 Dec 2013
TL;DR: A novel prior estimator called 3D-reconvergence is proposed to evaluate wire length of the netlists in 3D FPGA designs and shows that the partitioning approach could lead to better physical layout results.
Abstract: In 3D FPGA designs, the circuit elements are distributed among multiple layers. Therefore, the partition strategies will influence the optimization of the entire design. Without the layout information, it is quite difficult to evaluate the effect of partitioning before placement. As a prior estimation model, re-convergence has shown its efficiency to estimate wire length before placement in 2D FPGA designs. However, when it comes to 3D FPGA, the traditional method is no longer applicable due to the change of routing architecture. In this paper, we propose a novel prior estimator called 3D-reconvergence to evaluate wire length of the netlists in 3D FPGA designs. A reconvergence-aware layer partition (RALP) algorithm for 3D FPGA design is proposed. Experimental results show that our partitioning approach could lead to better physical layout results. Compared with the traditional min-cut based partitioning approach, the design flow with RALP can obtain better routing results by reducing 7.06% wire length and 4.86% delay for 2-layer designs, 4.71% wire length and 4.73% delay for 3-layer designs.

Proceedings Article•DOI•
02 Mar 2013
TL;DR: This paper presents an aspect-oriented approach supported by a tool chain that deals with functional and non-functional requirements in an integrated manner and discusses how the approach can be applied to development of safety-critical systems and provides experimental results.
Abstract: The development of avionics systems is typically a tedious and cumbersome process. In addition to the required functions, developers must consider various and often conflicting non-functional requirements such as safety, performance, and energy efficiency. Certainly, an integrated approach with a seamless design flow that is capable of requirementsmodelling and supporting refinement down to an actual implementation in a traceable way, may lead to a significant acceleration of development cycles. This paper presents an aspect-oriented approach supported by a toolchain that deals with functional and non-functional requirements in an integrated manner. It also discusses how the approach can be applied to development of safety-critical systems and provides experimental results.

Proceedings Article•DOI•
Tim Todman1, Wayne Luk1•
24 Oct 2013
TL;DR: This work includes an abstract approach to adding assertions and exceptions to a design, a concrete implementation for Maxeler streaming designs, and an evaluation that shows low overhead for adding exceptions to the design.
Abstract: We present an approach to enable run-time, in-circuit assertions and exceptions in reconfigurable hardware designs. Static, compile-time checking, including formal verification, can catch many errors before a reconfigurable design is implemented. However, many other errors cannot be caught by static approaches, including those due to run-time data. Our approach allows users to add run-time assertions and exceptions to a design, giving multiple ways to handle run-time errors. Our work includes an abstract approach to adding assertions and exceptions to a design, a concrete implementation for Maxeler streaming designs, and an evaluation. Results show low overhead for adding exceptions to a design.

Proceedings Article•DOI•
Ce Guo1, Wayne Luk1•
05 Jun 2013
TL;DR: A novel pipeline-friendly HAC estimation algorithm derived from a mathematical specification is described, by applying transformations to eliminate conditionals, to parallelize arithmetic, and to promote data reuse in computation to be efficient and scalable from both theoretical and empirical perspectives.
Abstract: Heteroskedasticity and autocorrelation consistent (HAC) covariance matrix estimation, or HAC estimation in short, is one of the most important techniques in time series analysis and forecasting. It serves as a powerful analytical tool for hypothesis testing and model verification. However, HAC estimation for long and high-dimensional time series is computationally expensive. This paper describes a novel pipeline-friendly HAC estimation algorithm derived from a mathematical specification, by applying transformations to eliminate conditionals, to parallelize arithmetic, and to promote data reuse in computation. We then develop a fully-pipelined hardware architecture based on the proposed algorithm. This architecture is shown to be efficient and scalable from both theoretical and empirical perspectives. Experimental results show that an FPGA-based implementation of the proposed architecture is up to 111 times faster than an optimised CPU implementation with one core, and 14 times faster than a CPU with eight cores.