scispace - formally typeset
Proceedings ArticleDOI

Efficient QR Decomposition Using Low Complexity Column-wise Givens Rotation (CGR)

TLDR
A novel Givens Rotation (GR) based QRD (GR-QRD) where the computational complexity of GR is reduced and the algorithm is implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA).
Abstract
QR decomposition (QRD) is a widely used Numerical Linear Algebra (NLA) kernel with applications ranging from SONAR beamforming to wireless MIMO receivers. In this paper, we propose a novel Givens Rotation (GR) based QRD (GR-QRD) where we reduce the computational complexity of GR and exploit higher degree of parallelism. This low complexity Column-wise GR (CGR) can annihilate multiple elements of a column of a matrix simultaneously. The algorithm is first realized on a Two-Dimensional (2D) systolic array and then implemented on REDEFINE which is a Coarse Grained run-time Reconfigurable Architecture (CGRA). We benchmark the proposed implementation against state-of-the-art implementations to report better throughput, convergence and scalability.

read more

Citations
More filters
Proceedings ArticleDOI

Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations

TL;DR: This paper carries out extensive micro-architectural exploration for accelerating core kernels like Matrix Multiplication (MM) (BLAS-3) for LU and QR factorizations and achieves up to 8x speed-up for MM in a CGRA environment.
Proceedings ArticleDOI

Scalable and energy-efficient reconfigurable accelerator for column-wise givens rotation

TL;DR: A new layered reconfigurable architecture is proposed which exploits modularity, scalability and flexibility to achieve high energy efficiency and memory bandwidth and achieves a clean trade-off of execution speed versus area, while keeping relatively constant energy.
Proceedings ArticleDOI

Efficient and scalable CGRA-based implementation of Column-wise Givens Rotation

TL;DR: These algorithms allow annihilation of multiple elements in a column of the input matrix simultaneously, without a dependency bottle-neck allowing increased parallelism, resource sharing and scalability.

An Efficient Speech Perceptual Hashing Authentication Algorithm Based on Wavelet Packet Decomposition

TL;DR: The experiment results illustrate that the proposed algorithm was very robust in content preserving operations, had a very low hash bit rate, and can meet the requirements of real-time speech authentication with high certification efficiency.
Journal ArticleDOI

Efficient Realization of Householder Transform through Algorithm-Architecture Co-design for Acceleration of QR Factorization

TL;DR: Efficient realization of Householder Transform (HT) based QR factorization through algorithm-architecture co-design is presented where performance improvement of 3-90x in-terms of Gflops/watt over state-of-the-art multicore, General Purpose Graphics Processing Units (GPGPUs), Field Programmable Gate Arrays (FPGAs), and ClearSpeed CSX700 is achieved.
References
More filters
Proceedings ArticleDOI

Matrix Triangularization By Systolic Arrays

TL;DR: In this paper, a unified concept of using systolic arrays to perform real-time triangularization for both general and band matrices is presented, and a framework is presented for the solution of linear systems with pivoting and for least squares computations.
Journal ArticleDOI

On Stable Parallel Linear System Solvers

TL;DR: Three stable parallel algorithms for solving dense and tndlagonai systems of lmear equations are discussed and one of the algorithms presented here is superior to the best previous algorithm in that with a modest increase in time.
Related Papers (5)