scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

TL;DR: This article presents a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures, particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems.
Abstract: Recent developments in the capacity of modern Field Programmable Gate Arrays (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this article we present a widely parallel and deeply pipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple small-to-medium-sized dense systems of linear equations and can be used as a stand-alone solver or as building block to solve higher-order systems. In this article it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deep pipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post place-and-route results on a readily available VirtexII-6000 demonstrate sustained performance of 5 GFlops, and results on a Virtex5-330 indicate sustained performance of 35 GFlops. A comparison with an optimized software implementation running on a high-end CPU demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
15 Apr 2015
TL;DR: This work surveys the field of reconfigurable computing, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.
Abstract: Reconfigurable architectures can bring unique capabilities to computational tasks. They offer the performance and energy efficiency of hardware with the flexibility of software. In some domains, they are the only way to achieve the required, real-time performance without fabricating custom integrated circuits. Their functionality can be upgraded and repaired during their operational lifecycle and specialized to the particular instance of a task. We survey the field of reconfigurable computing, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.

178 citations


Cites methods from "A High Throughput FPGA-Based Floati..."

  • ...Conjugate gradient uses iterative matrix-vector multiplication to solve systems of linear equations including dense matrix problems [261] and sparse problems [262]....

    [...]

Proceedings ArticleDOI
12 Mar 2016
TL;DR: TABLA provides a template-based framework that generates accelerators for a class of machine learning algorithms and rigorously compares the benefits of FPGA acceleration to multi-core CPUs and many-core GPUs using real hardware measurements.
Abstract: A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

158 citations

Journal ArticleDOI
TL;DR: The novel blind classification algorithm proposed in this paper exploits the cyclostationarity property of space-time block codes (STBCs) for the classification of multiple antenna systems in the presence of possible transmission impairments.
Abstract: Signal classification is important in various commercial and military applications. Multiple antenna systems complicate the signal classification problem since there is now the issue of estimating the number and configuration of transmit antennas. The novel blind classification algorithm proposed in this paper exploits the cyclostationarity property of space-time block codes (STBCs) for the classification of multiple antenna systems in the presence of possible transmission impairments. Analytical expressions for the second-order cyclic statistics used as the basis of the algorithm are derived, and the computational cost of the proposed algorithm is considered. This algorithm avoids the need for a priori knowledge of the channel coefficients, modulation, carrier phase, and timing offsets. Moreover, it does not need accurate information about the transmission data rate and carrier frequency offset. Monte Carlo simulation results demonstrate a good classification performance with low sensitivity to phase noise and channel effects, including frequency-selective fading and Doppler shift.

76 citations


Cites background from "A High Throughput FPGA-Based Floati..."

  • ...For example, field programmable gate arrays (FPGAs) can readily achieve several Gigaflops per second [30], [31] provided the potential parallelism is effectively exploited, and highly optimized FPGA implementations of the FFT are available....

    [...]

Proceedings ArticleDOI
01 Apr 2017
TL;DR: A single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU and a novel compression scheme—called stochastic quantization, specifically designed for machine learning applications is presented.
Abstract: Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme—called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.

70 citations

Proceedings ArticleDOI
27 Feb 2011
TL;DR: This work presents a high-throughput floating-point FPGA implementation that exploits the parallelism inherent in interior-point optimization methods and shows that by considering that the QPs come from a control formulation, it is possible to make heavy use of the sparsity in the problem to save computations and reduce memory requirements.
Abstract: Model predictive control (MPC) is an advanced industrial control technique that relies on the solution of a quadratic programming (QP) problem at every sampling instant to determine the input action required to control the current and future behaviour of a physical system. Its ability in handling large multiple input multiple output (MIMO) systems with physical constraints has led to very successful applications in slow processes, where there is sufficient time for solving the optimization problem between sampling instants. The application of MPC to faster systems, which adds the requirement of greater sampling frequencies, relies on new ways of finding faster solutions to QP problems. Field-programmable gate arrays (FPGAs) are specially well suited for this application due to the large amount of computation for a small amount of I/O. In addition, unlike a software implementation, an FPGA can provide the precise timing guarantees required for interfacing the controller to the physical system. We present a high-throughput floating-point FPGA implementation that exploits the parallelism inherent in interior-point optimization methods. It is shown that by considering that the QPs come from a control formulation, it is possible to make heavy use of the sparsity in the problem to save computations and reduce memory requirements by 75%. The implementation yields a 6.5x improvement in latency and a 51x improvement in throughput for large problems over a software implementation running on a general purpose microprocessor.

53 citations

References
More filters
Book
01 Jan 1983

34,729 citations

Journal ArticleDOI
TL;DR: An iterative algorithm is given for solving a system Ax=k of n linear equations in n unknowns and it is shown that this method is a special case of a very general method which also includes Gaussian elimination.
Abstract: An iterative algorithm is given for solving a system Ax=k of n linear equations in n unknowns. The solution is given in n steps. It is shown that this method is a special case of a very general method which also includes Gaussian elimination. These general algorithms are essentially algorithms for finding an n dimensional ellipsoid. Connections are made with the theory of orthogonal polynomials and continued fractions.

7,598 citations

01 Mar 1994
TL;DR: The Conjugate Gradient Method as discussed by the authors is the most prominent iterative method for solving sparse systems of linear equations and is a composite of simple, elegant ideas that almost anyone can understand.
Abstract: The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written so that even their own authors would be mystified, if they bothered to read their own writing. For this reason, an understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded the mumblings of their forebears. Nevertheless, the Conjugate Gradient Method is a composite of simple, elegant ideas that almost anyone can understand. Of course, a reader as intelligent as yourself will learn them almost effortlessly. The idea of quadratic forms is introduced and used to derive the methods of Steepest Descent, Conjugate Directions, and Conjugate Gradients. Eigenvectors are explained and used to examine the convergence of the Jacobi Method, Steepest Descent, and Conjugate Gradients. Other topics include preconditioning and the nonlinear Conjugate Gradient Method. I have taken pains to make this article easy to read. Sixty-two illustrations are provided. Dense prose is avoided. Concepts are explained in several different ways. Most equations are coupled with an intuitive interpretation.

2,535 citations


Additional excerpts

  • ...Conjugate Gradient algorithm. iterates until the residual error is suf.ciently small [Shewchuk 2003]....

    [...]

Proceedings ArticleDOI
07 Nov 1998
TL;DR: An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).
Abstract: This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time consuming process The work described here can help in automating much of this process We will concentrate our efforts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS) In particular, the work presented here is for general matrix multiply, DGEMM However much of the technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations

1,115 citations

Book
01 Jan 2007
TL;DR: In this article, a detailed introduction to the analysis and design of multiple-input multiple-output (MIMO) wireless systems is presented, and the fundamental capacity limits of MIMO systems are examined.
Abstract: Multiple-input multiple-output (MIMO) technology constitutes a breakthrough in the design of wireless communications systems, and is already at the core of several wireless standards. Exploiting multipath scattering, MIMO techniques deliver significant performance enhancements in terms of data transmission rate and interference reduction. This book is a detailed introduction to the analysis and design of MIMO wireless systems. Beginning with an overview of MIMO technology, the authors then examine the fundamental capacity limits of MIMO systems. Transmitter design, including precoding and space-time coding, is then treated in depth, and the book closes with two chapters devoted to receiver design. Written by a team of leading experts, the book blends theoretical analysis with physical insights, and highlights a range of key design challenges. It can be used as a textbook for advanced courses on wireless communications, and will also appeal to researchers and practitioners working on MIMO wireless systems.

721 citations