A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

doi:10.1145/1661438.1661439

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Reconfigurable Computing Architectures

[...]

Russell Tessier¹, Kenneth L. Pocek², André DeHon³•Institutions (3)

University of Massachusetts Amherst¹, Intel², University of Pennsylvania³

15 Apr 2015

TL;DR: This work surveys the field of reconfigurable computing, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.

...read moreread less

Abstract: Reconfigurable architectures can bring unique capabilities to computational tasks. They offer the performance and energy efficiency of hardware with the flexibility of software. In some domains, they are the only way to achieve the required, real-time performance without fabricating custom integrated circuits. Their functionality can be upgraded and repaired during their operational lifecycle and specialized to the particular instance of a task. We survey the field of reconfigurable computing, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.

...read moreread less

178 citations

Cites methods from "A High Throughput FPGA-Based Floati..."

...Conjugate gradient uses iterative matrix-vector multiplication to solve systems of linear equations including dense matrix problems [261] and sparse problems [262]....
[...]

Proceedings Article•DOI•

TABLA: A unified template-based framework for accelerating statistical machine learning

[...]

Divya Mahajan¹, Jongse Park¹, Emmanuel Amaro¹, Hardik Sharma¹, Amir Yazdanbakhsh¹, Joon Kyung Kim¹, Hadi Esmaeilzadeh¹ - Show less +3 more•Institutions (1)

Georgia Institute of Technology¹

12 Mar 2016

TL;DR: TABLA provides a template-based framework that generates accelerators for a class of machine learning algorithms and rigorously compares the benefits of FPGA acceleration to multi-core CPUs and many-core GPUs using real hardware measurements.

...read moreread less

Abstract: A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

...read moreread less

158 citations

Journal Article•DOI•

Classification of Space-Time Block Codes Based on Second-Order Cyclostationarity with Transmission Impairments

[...]

Mohamed Marey¹, Octavia A. Dobre¹, Robert Inkol•Institutions (1)

St. John's University¹

26 Apr 2012-IEEE Transactions on Wireless Communications

TL;DR: The novel blind classification algorithm proposed in this paper exploits the cyclostationarity property of space-time block codes (STBCs) for the classification of multiple antenna systems in the presence of possible transmission impairments.

...read moreread less

Abstract: Signal classification is important in various commercial and military applications. Multiple antenna systems complicate the signal classification problem since there is now the issue of estimating the number and configuration of transmit antennas. The novel blind classification algorithm proposed in this paper exploits the cyclostationarity property of space-time block codes (STBCs) for the classification of multiple antenna systems in the presence of possible transmission impairments. Analytical expressions for the second-order cyclic statistics used as the basis of the algorithm are derived, and the computational cost of the proposed algorithm is considered. This algorithm avoids the need for a priori knowledge of the channel coefficients, modulation, carrier phase, and timing offsets. Moreover, it does not need accurate information about the transmission data rate and carrier frequency offset. Monte Carlo simulation results demonstrate a good classification performance with low sensitivity to phase noise and channel effects, including frequency-selective fading and Doppler shift.

...read moreread less

76 citations

Cites background from "A High Throughput FPGA-Based Floati..."

...For example, field programmable gate arrays (FPGAs) can readily achieve several Gigaflops per second [30], [31] provided the potential parallelism is effectively exploited, and highly optimized FPGA implementations of the FFT are available....
[...]

Proceedings Article•DOI•

FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off

[...]

Kaan Kara¹, Dan Alistarh¹, Gustavo Alonso², Onur Mutlu³, Ce Zhang¹ - Show less +1 more•Institutions (3)

ETH Zurich¹, Autonomous University of Guerrero², Carnegie Mellon University³

01 Apr 2017

TL;DR: A single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU and a novel compression scheme—called stochastic quantization, specifically designed for machine learning applications is presented.

...read moreread less

Abstract: Stochastic gradient descent (SGD) is a commonly used algorithm for training linear machine learning models. Based on vector algebra, it benefits from the inherent parallelism available in an FPGA. In this paper, we first present a single-precision floating-point SGD implementation on an FPGA that provides similar performance as a 10-core CPU. We then adapt the design to make it capable of processing low-precision data. The low-precision data is obtained from a novel compression scheme—called stochastic quantization, specifically designed for machine learning applications. We test both full-precision and low-precision designs on various regression and classification data sets. We achieve up to an order of magnitude training speedup when using low-precision data compared to a full-precision SGD on the same FPGA and a state-of-the-art multi-core solution, while maintaining the quality of training. We open source the designs presented in this paper.

...read moreread less

70 citations

Proceedings Article•DOI•

An FPGA implementation of a sparse quadratic programming solver for constrained predictive control

[...]

Juan L. Jerez¹, George A. Constantinides¹, Eric C. Kerrigan¹•Institutions (1)

Imperial College London¹

27 Feb 2011

TL;DR: This work presents a high-throughput floating-point FPGA implementation that exploits the parallelism inherent in interior-point optimization methods and shows that by considering that the QPs come from a control formulation, it is possible to make heavy use of the sparsity in the problem to save computations and reduce memory requirements.

...read moreread less

Abstract: Model predictive control (MPC) is an advanced industrial control technique that relies on the solution of a quadratic programming (QP) problem at every sampling instant to determine the input action required to control the current and future behaviour of a physical system. Its ability in handling large multiple input multiple output (MIMO) systems with physical constraints has led to very successful applications in slow processes, where there is sufficient time for solving the optimization problem between sampling instants. The application of MPC to faster systems, which adds the requirement of greater sampling frequencies, relies on new ways of finding faster solutions to QP problems. Field-programmable gate arrays (FPGAs) are specially well suited for this application due to the large amount of computation for a small amount of I/O. In addition, unlike a software implementation, an FPGA can provide the precise timing guarantees required for interfacing the controller to the physical system. We present a high-throughput floating-point FPGA implementation that exploits the parallelism inherent in interior-point optimization methods. It is shown that by considering that the QPs come from a control formulation, it is possible to make heavy use of the sparsity in the problem to save computations and reduce memory requirements by 75%. The implementation yields a 6.5x improvement in latency and a 51x improvement in throughput for large problems over a software implementation running on a general purpose microprocessor.

...read moreread less

53 citations

Collapse

A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

Citations

Cites methods from "A High Throughput FPGA-Based Floati..."

Cites background from "A High Throughput FPGA-Based Floati..."

References

Additional excerpts

Related Papers (5)