scispace - formally typeset
Search or ask a question

Showing papers on "QR decomposition published in 2009"


Posted Content
TL;DR: In this article, a modular framework for constructing randomized algorithms that compute partial matrix decompositions is presented, which uses random sampling to identify a subspace that captures most of the action of a matrix and then the input matrix is compressed to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization.
Abstract: Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed---either explicitly or implicitly---to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.

2,356 citations


Journal ArticleDOI
TL;DR: This work generalizes a lower bound on the amount of communication needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm to a much wider variety of algorithms, including LU factorization, Cholesky factors, LDLT factors, QR factors, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values.
Abstract: In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense,n-by-nmatrix-multiplication using the conventionalO(n 3 ) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as !(#arithmetic operations / ! M), whereMis the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization,LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.

257 citations


Journal ArticleDOI
TL;DR: Numerical tests are presented for the interval and the square, which show that approximate Fekete points are well suited for polynomial interpolation and cubature.
Abstract: We propose a numerical method (implemented in Matlab) for computing approximate Fekete points on compact multivariate domains. It relies on the search of maximum volume submatrices of Vandermonde matrices computed on suitable discretization meshes, and uses a simple greedy algorithm based on QR factorization with column pivoting. The method gives also automatically an algebraic cubature formula, provided that the moments of the underlying polynomial basis are known. Numerical tests are presented for the interval and the square, which show that approximate Fekete points are well suited for polynomial interpolation and cubature.

101 citations


Journal ArticleDOI
TL;DR: This paper presents architectures and field-programmable gate-array designs of two variants of the DCD algorithm, known as cyclic and leading DCD algorithms, and proposes fixed-point designs that provide an accuracy performance that is very close to the performance of floating-point counterparts and require significantly lower FPGA resources than techniques based on QR decomposition.
Abstract: In the areas of signal processing and communications, such as antenna-array beamforming, adaptive filtering, multiuser and multiple-input-multiple-output (MIMO) detection, channel estimation and equalization, echo and interference cancellation, and others, solving linear systems of equations often provides an optimal performance. However, this is also a very complicated operation that designers try to avoid by proposing different suboptimal techniques. The dichotomous coordinate descent (DCD) algorithm allows linear systems of equations to be solved with high computational efficiency. In this paper, we present architectures and field-programmable gate-array (FPGA) designs of two variants of the DCD algorithm, which are known as cyclic and leading DCD algorithms. For each of these techniques, we present serial designs, group-2 and group-4 designs, as well as a design with parallel update of the residual vector for the cyclic DCD algorithm. These designs have different degrees of parallelism, thus enabling a tradeoff between FPGA resources and computation time. The serial designs require the smallest FPGA resources; they are well suited for applications where many parallel solvers are required, e.g., for detection in MIMO-orthogonal-frequency-division-multiplexing communication systems. The parallelism introduced in the proposed group-2 and group-4 designs allows faster convergence to the true solution at the expense of an increase in FPGA resources. The design with parallel update of the residual vector provides the fastest convergence speed; however, if the system size is high, it may result in a significant increase in FPGA resources. The proposed fixed-point designs provide an accuracy performance that is very close to the performance of floating-point counterparts and require significantly lower FPGA resources than techniques based on QR decomposition.

96 citations


Journal ArticleDOI
TL;DR: By an extension of the implicit Q theorem, the palindromic QR algorithm is shown to be equivalent to a previously developed explicit version and the classical convergence theory for the QR algorithm can be extended to prove local quadratic convergence.
Abstract: In the spirit of the Hamiltonian QR algorithm and other bidirectional chasing algorithms, a structure-preserving variant of the implicit QR algorithm for palindromic eigenvalue problems is proposed. This new palindromic QR algorithm is strongly backward stable and requires less operations than the standard QZ algorithm, but is restricted to matrix classes where a preliminary reduction to structured Hessenberg form can be performed. By an extension of the implicit Q theorem, the palindromic QR algorithm is shown to be equivalent to a previously developed explicit version. Also, the classical convergence theory for the QR algorithm can be extended to prove local quadratic convergence. We briefly demonstrate how even eigenvalue problems can be addressed by similar techniques.

90 citations


Journal ArticleDOI
TL;DR: A generalization of the Gröbner basis method for polynomial equation solving, which improves overall numerical stability and is shown how the action matrix can be computed in the general setting of an arbitrary linear basis for ℂ[x]/I.
Abstract: This paper presents several new results on techniques for solving systems of polynomial equations in computer vision. Grobner basis techniques for equation solving have been applied successfully to several geometric computer vision problems. However, in many cases these methods are plagued by numerical problems. In this paper we derive a generalization of the Grobner basis method for polynomial equation solving, which improves overall numerical stability. We show how the action matrix can be computed in the general setting of an arbitrary linear basis for ?[x]/I. In particular, two improvements on the stability of the computations are made by studying how the linear basis for ?[x]/I should be selected. The first of these strategies utilizes QR factorization with column pivoting and the second is based on singular value decomposition (SVD). Moreover, it is shown how to improve stability further by an adaptive scheme for truncation of the Grobner basis. These new techniques are studied on some of the latest reported uses of Grobner basis methods in computer vision and we demonstrate dramatically improved numerical stability making it possible to solve a larger class of problems than previously possible.

88 citations


Proceedings ArticleDOI
08 Mar 2009
TL;DR: This paper discusses the architectural characteristics of GPUs and explains how a high-performance implementation of QR decomposition may be implemented and provides detailed performance analysis of the resulting implementation for real-valued matrices and offers recommendations for achieving high performance.
Abstract: QR decomposition is a computationally intensive linear algebra operation that factors a matrix A into the product of a unitary matrix Q and upper triangular matrix R. Adaptive systems commonly employ QR decomposition to solve overdetermined least squares problems. Performance of QR decomposition is typically the crucial factor limiting problem sizes.Graphics Processing Units (GPUs) are high-performance processors capable of executing hundreds of floating point operations in parallel. As commodity accelerators for 3D graphics, GPUs offer tremendous computational performance at relatively low costs. While GPUs are favorable to applications with much inherent parallelism requiring coarse-grain synchronization between processors, methods for efficiently utilizing GPUs for algorithms computing QR decomposition remain elusive.In this paper, we discuss the architectural characteristics of GPUs and explain how a high-performance implementation of QR decomposition may be implemented. We provide detailed performance analysis of the resulting implementation for real-valued matrices and offer recommendations for achieving high performance to future developers of dense linear algebra procedures for GPUs. Our implementation sustains 143 GFLOP/s, and we believe this is the fastest announced QR implementation executing entirely on the GPU.

65 citations


Book
01 Jan 2009
TL;DR: QR Decomposition An Annotated Bibliography contains references to Adaptive Filters, Conventional and Inverse QRD-RLS Algorithms, and Numerical Stability Properties.
Abstract: QR Decomposition An Annotated Bibliography.- to Adaptive Filters.- Conventional and Inverse QRD-RLS Algorithms.- Fast QRD-RLS Algorithms.- QRD Least-Squares Lattice Algorithms.- Multichannel Fast QRD-RLS Algorithms.- Householder-Based RLS Algorithms.- Numerical Stability Properties.- Finite and Infinite-Precision Properties of QRD-RLS Algorithms.- On Pipelined Implementations of QRD-RLS Adaptive Filters.- Weight Extraction of Fast QRD-RLS Algorithms.- Linear Constrained QRD-Based Algorithm.

63 citations


Patent
09 Feb 2009
TL;DR: In this article, a MIMO receiver is provided with a preprocessor for performing QR decomposition of a channel matrix H wherein the factored reduced matrix R is used in place of H and Q*y is used to replace the received vector y in a maximum likelihood detector (MLD).
Abstract: A MIMO receiver is provided with a preprocessor for performing QR decomposition of a channel matrix H wherein the factored reduced matrix R is used in place of H and Q*y is used in place of the received vector y in a maximum likelihood detector (“MLD”). The maximum likelihood detector might be a hard-decision MLD or a soft-decision MLD. A savings of computational complexity can be used to provide comparable results more quickly, using less circuitry, and/or requiring less consumed energy, or performance can be improved for a fixed amount of time, circuitry and/or energy. Where the MLD uses approximations, such as finite resolution calculations (fixed point or the like) or L1 Norm approximations, the reduced number of operations resulting from using the reduced matrix results in improved approximations as a result of the finite resolution operations. Other methods of reducing the channel matrix might be used for suitable and/or cumulative advantages.

62 citations


Journal ArticleDOI
TL;DR: Results indicate that the SOCA algorithm has an attractive performance-complexity profile for both fast and slow fading 4 × 4 and 8 × 8 MIMO channels with quadrature amplitude modulation (QAM) inputs.
Abstract: We present a soft-output multiple-input multiple-output (MIMO) detection algorithm that achieves near max-log optimal error rate performance with low- and fixed-computational complexity. The proposed smart ordering and candidate adding (SOCA) algorithm combines a smart-ordered QR decomposition with smart candidate adding and a parallel layer-by-layer search of the detection tree. In contrast to prior algorithms that use smart candidate adding, the proposed algorithm has fixed computational complexity, and it never visits a node more than once. Results indicate that the SOCA algorithm has an attractive performance-complexity profile for both fast and slow fading 4 × 4 and 8 × 8 MIMO channels with quadrature amplitude modulation (QAM) inputs.

55 citations


Proceedings ArticleDOI
24 May 2009
TL;DR: A hybrid QRD scheme is proposed that uses a combination of multi-dimensional Givens rotations, Householder transformations and the conventional two-dimensional (2D) Given rotations to both reduce the overall computational complexity and achieve higher execution parallelism.
Abstract: QR decomposition (QRD) is an essential signal processing task for many MIMO signal detection schemes. However, decomposition of complex MIMO channel matrices with large dimensions leads to high computational complexity, and hence results in either large core area or low throughput. Moreover, for mobile communication applications that involve fast-varying channels, it is required to perform QR decomposition with low processing latency. In this paper, we propose a hybrid QRD scheme that uses a combination of multi-dimensional Givens rotations, Householder transformations and the conventional two-dimensional (2D) Givens rotations to both reduce the overall computational complexity and achieve higher execution parallelism. To prove the effectiveness of the proposed QRD scheme, a novel pipelined architecture is presented that uses un-rolled pipelined CORDIC processors iteratively to maximize throughput and resource utilization, while minimizing the gate count. The architectures of the main data processing modules, namely the 2D, Householder 3D and 4D/2D configurable pipelined CORDIC processors, are also presented. Synthesis results for a 4×4 MIMO detector in 0.13µm CMOS process indicate that this QRD design computes a 4×4 complex R matrix and four updated 4×1 complex symbol vectors every 40 cycles, at a clock frequency of 270 MHz and requires 36K gates. The proposed design achieves the lowest processing time and the highest throughput reported to-date for the same framework.

Proceedings ArticleDOI
24 May 2009
TL;DR: Implementation of an iterative QR decomposition (QRD) (IQRD) architecture based on the modified Gram-Schmidt (MGS) algorithm is proposed in this paper and the hardware is constructed by the diagonal and the triangular process with fewer gate counts and lower power consumption.
Abstract: Implementation of iterative QR decomposition (QRD) architecture based on the modified Gram-Schmidt (MGS) algorithm is proposed in this paper. In order to achieve computational efficiency with robust numerical stability, a triangular systolic array (TSA) for QRD of large size matrices is presented. Therefore, the TSA architecture can be modified into iterative architecture for reducing hardware cost that is called iterative QRD (IQRD). The IQRD hardware is constructed by the diagonal process (DP) and the triangular process (TP) with fewer gate counts and lower power consumption than TSAQRD. For a 4×4 matrix, the hardware area of the proposed IQRD can reduce about 76% of the gate counts in TSAQRD. For a generic square matrix of order n IQRD, the latency required is 2n−1 time units, which is based on the MGS algorithm. Thus, the total clock latency is only n(2n+3) cycles.

Journal ArticleDOI
TL;DR: A simple and fast computational algorithm based on the relationship between the Effective Independence and Modal Kinetic Energy method, and on a downdating algorithm of the QR decomposition for a reduced modal matrix for sensor placement is presented.

Journal ArticleDOI
TL;DR: The linear system refinement algorithm is applied to Björck’s augmented linear system formulation of an LLS problem and will be included in a future release of LAPACK and can be extended to the other types of least squares problems.
Abstract: We present the algorithm, error bounds, and numerical results for extra-precise iterative refinement applied to overdetermined linear least squares (LLS) problems. We apply our linear system refinement algorithm to Bjorck’s augmented linear system formulation of an LLS problem. Our algorithm reduces the forward normwise and componentwise errors to O(ew), where ew is the working precision, unless the system is too ill conditioned. In contrast to linear systems, we provide two separate error bounds for the solution x and the residual r. The refinement algorithm requires only limited use of extra precision and adds only O(mn) work to the O(mn2) cost of QR factorization for problems of size m-by-n. The extra precision calculation is facilitated by the new extended-precision BLAS standard in a portable way, and the refinement algorithm will be included in a future release of LAPACK and can be extended to the other types of least squares problems.

Posted Content
TL;DR: In this paper, interpolation-based QR decomposition algorithms were proposed for MIMO-OFDM systems with high number of data-carrying tones and small channel order.
Abstract: Detection algorithms for multiple-input multiple-output (MIMO) wireless systems based on orthogonal frequency-division multiplexing (OFDM) typically require the computation of a QR decomposition for each of the data-carrying OFDM tones. The resulting computational complexity will, in general, be significant, as the number of data-carrying tones ranges from 48 (as in the IEEE 802.11a/g standards) to 1728 (as in the IEEE 802.16e standard). Motivated by the fact that the channel matrices arising in MIMO-OFDM systems are highly oversampled polynomial matrices, we formulate interpolation-based QR decomposition algorithms. An in-depth complexity analysis, based on a metric relevant for very large scale integration (VLSI) implementations, shows that the proposed algorithms, for sufficiently high number of data-carrying tones and sufficiently small channel order, provably exhibit significantly smaller complexity than brute-force per-tone QR decomposition.

Journal ArticleDOI
TL;DR: This paper presents a low-complexity generalized sphere decoding (GSD) approach by transforming the original underdetermined problem into the full-column-rank one so that standard SD can be directly applied on the transformed problem.
Abstract: For underdetermined linear systems, original sphere decoding (SD) algorithms fail due to zero diagonal elements in the upper-triangular matrix of the QR or Cholesky factorization of the underdetermined matrix. To solve this problem, this paper presents a low-complexity generalized sphere decoding (GSD) approach by transforming the original underdetermined problem into the full-column-rank one so that standard SD can be directly applied on the transformed problem. Since the introduced transformation maintains the dimension of the original problem for all M-QAM's, the proposed GSD approach provides significant reduction in complexity as compared to other GSD schemes, especially for M-QAM with large signaling constellation. Both performance and expected complexity are analyzed to provide the comprehensive relationships between the performance and complexity of the proposed GSD and its parameters. Illustrative simulation and analytical results are in good agreement in terms of both the performance and complexity and indicate that with the properly selected design parameters, the proposed GSD scheme can approach the optimum maximumlikelihood decoding (MLD) performance with low complexity for underdetermined linear communication systems including underdetermined MIMO systems, and the proposed expected complexity analysis can be used as reliable complexity estimation for practical implementation of the proposed algorithm and serve as reference for other GSD algorithms.

Journal ArticleDOI
01 Jan 2009
TL;DR: It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism.
Abstract: The QR factorization is one of the most important operations in dense linear algebra, offering a numerically stable method for solving linear systems of equations including overdetermined and underdetermined systems. Modern implementations of the QR factorization, such as the one in the LAPACK library, suffer from performance limitations due to the use of matrix-vector type operations in the phase of panel factorization. These limitations can be remedied by using the idea of updating of QR factorization, rendering an algorithm, which is much more scalable and much more suitable for implementation on a multi-core processor. It is demonstrated how the potential of the cell broadband engine can be utilized to the fullest by employing the new algorithmic approach and successfully exploiting the capabilities of the chip in terms of single instruction multiple data parallelism, instruction level parallelism and thread-level parallelism.

Proceedings ArticleDOI
14 Jun 2009
TL;DR: A FPGA architecture of MIMO decoder based on the fixed sphere decoder (FSD) algorithm that achieves close-to ML BER performance with a reduced computational complexity and fixed throughput for the IEEE 802.16e WiMAX mobile systems.
Abstract: In this paper, we present a FPGA prototyping of the MIMO Decoder for the IEEE 802.16e WiMAX mobile systems. The IEEE 802.16e standard supports three types of MIMO space time codes (STC), referred to in the standard by matrix A, B, and C, that achieve different levels of throughput and diversity depending on the quality of the MIMO channels. In particular, the STC matrix A achieves full diversity by employing the Alomuti coding, while the STC matrix B achieves full rate by employing spatial multiplexing and the STC matrix C achieves full rate and diversity by employing the Golden code. In this paper, we present a FPGA architecture of MIMO decoder based on the fixed sphere decoder (FSD) algorithm that achieves close-to ML BER performance with a reduced computational complexity and fixed throughput. We show how a single FSD can be used to decode the different STC by adaptively processing the received signal according to the STC type prior to be fed to the FSD. The FPGA design is incorporated with a QR decomposition of the channel matrix. The proposed FSD achieves fixed and high throughput required for the WiMAX systems. The FPGA implementation is incorporated with a MATLAB simulation model of an FUSC OFDMA-based WiMAX 2x2 MIMO system to validate the hardware design.

01 Jan 2009
TL;DR: In this paper, the authors applied maximum likelihood detection employing QR decomposition and M-algorithm (QRM-MLD) to the SC signal detection with antenna diversity reception.
Abstract: The frequency-domain received single-carrier (SC) signal can be expressed using the matrix representation similar to the multiple-input multiple-output (MIMO) multiplexing. The signal detection schemes developed for MIMO multiplexing can be applied to the SC transmissions. In this paper, we apply maximum likelihood detection employing QR decomposition and M-algorithm (QRM-MLD) to the SC signal detection with antenna diversity reception. We show that by using antenna diversity reception, the number of surviving symbol candidates can be reduced. We evaluate, by the computer simulation, the bit error rate (BER) performance achievable by QRM-MLD and compare it with that achievable by the Vertical-Bell Laboratories Layered spacetime architecture (V-BLAST) detection.

Journal ArticleDOI
TL;DR: A two-dimensional systolic array QR decomposition is implemented on a Xilinx Virtex5 FPGA using the Givens rotation algorithm, which uses straightforward floating-point divide and square root implementations, which makes it easier to be used within a larger system.
Abstract: We have implemented a two-dimensional systolic array QR decomposition on a Xilinx Virtex5 FPGA using the Givens rotation algorithm. QR decomposition is a key step in many DSP applications including sonar beamforming, channel equalization, and 3G wireless communication. Compared to previous work that implements Givens rotations using a one-dimensional systolic array, our implementation uses a truly two-dimensional systolic array architecture. As a result, latency scales well for larger matrices. In addition, prior work avoids divide and square root operations in the Givens rotation algorithm by using special operations such as CORDIC or special number systems such as the logarithmic number system (LNS). In contrast, our design uses straightforward floating-point divide and square root implementations, which makes it easier to be used within a larger system. In our design, the input matrix size can be configured at compile time to many different sizes, making it easily scalable to future large FPGAs or over multiple FPGAs. The QR module is fully pipelined with a throughput of over 130MHz for the IEEE single-precision floating-point format. The peak performance for a 12 × 12 input matrix is approximately 35 GFLOPs.

Journal ArticleDOI
TL;DR: A dual-lattice view of the vertical Bell Labs Layered Space-Time (V-BLAST) detection is presented, and a partial reduction algorithm that only performs lattice reduction for the last several, weak substreams is proposed, offering a graceful tradeoff between performance and complexity for SIC-based MIMO detection.
Abstract: In this paper, we propose low-complexity lattice detection algorithms for successive interference cancelation (SIC) in multi-input multi-output (MIMO) communications. First, we present a dual-lattice view of the vertical Bell Labs Layered Space-Time (V-BLAST) detection. We show that V-BLAST ordering is equivalent to applying sorted QR decomposition to the dual basis, or equivalently, applying sorted Cholesky decomposition to the associated Gram matrix. This new view results in lower detection complexity and allows simultaneous ordering and detection. Second, we propose a partial reduction algorithm that only performs lattice reduction for the last several, weak substreams, whose implementation is also facilitated by the dual-lattice view. By tuning the block size of the partial reduction (hence the complexity), it can achieve a variable diversity order, hence offering a graceful tradeoff between performance and complexity for SIC-based MIMO detection. Numerical results are presented to compare the computational costs and to verify the achieved diversity order.

Journal ArticleDOI
TL;DR: It is shown that for any sequence of unit 2-norm $n$-vectors there is a special unitary matrix which is called a unitary augmentation of these vectors which can be used in the analyses without appealing to the MGS connection and the main theorem on orthogonalization is extended to cover the case of biorthogonalized.
Abstract: Charles Sheffield pointed out that the modified Gram-Schmidt (MGS) orthogonalization algorithm for the QR factorization of $B\!\in\!\R^{n\times k}$ is mathematically equivalent to the QR factorization applied to the matrix $B$ augmented with a $k\times k$ matrix of zero elements on top. This is true in theory for any method of QR factorization, but for Householder's method it is true in the presence of rounding errors as well. This knowledge has been the basis for several successful but difficult rounding error analyses of algorithms which in theory produce orthogonal vectors but significantly fail to do so because of rounding errors. Here we show that the same results can be found more directly and easily without recourse to the MGS connection. It is shown that for any sequence of $k$ unit 2-norm $n$-vectors there is a special $(n\!+\!k)$-square unitary matrix which we call a unitary augmentation of these vectors and that this matrix can be used in the analyses without appealing to the MGS connection. We describe the connection of this unitary matrix to Householder matrices. The new approach is applied to an earlier analysis to illustrate both the improvement in simplicity and advantages for future analyses. Some properties of this unitary matrix are derived. The main theorem on orthogonalization is then extended to cover the case of biorthogonalization.

Proceedings ArticleDOI
10 Nov 2009
TL;DR: A fast multilevel direct solver for electromagnetic scattering from quasi-planar objects is presented and it is shown that approximately O(N1.5) complexity is attained for the matrix compression and N being the number of unknowns for each right-hand-side solution.
Abstract: A fast multilevel direct solver for electromagnetic scattering from quasi-planar objects is presented. The solver relies on the compression of the off-diagonal blocks in the impedance matrix, which describe interactions between distinct domains. The compression is performed in a number of steps. First, the scatterer is decomposed into sub-domains using a multilevel quad-tree hierarchical subdivision. Then, the field radiated by each sub-domain, onto the rest of the scatterer, is determined using a Non-uniform sampling grid approach. Subsequently, a rank-revealing QR decomposition is applied to the grid matrix to find the current basis functions that actually contribute to the radiated field. At the same time, the employed decomposition singles out the most important grid points (called grid skeleton), from which the field on the observation domain can be reconstructed. Finally, compression of the local interacting currents and fields inside each sub-domain is performed using the Schur's complement method. The non-interacting currents are solved locally, whereas interacting currents and grid skeletons are repeatedly aggregated with neighboring sections in a multilevel process. The resulting compressed system of equations is solved directly. The algorithm is analyzed for performance and stability. It is shown that approximately O(N1.5) complexity is attained for the matrix compression and approximately O(N) for each right-hand-side solution, N being the number of unknowns. The analytical complexity estimates are supported by the results of a numerical case study.

Patent
Kyeong Jin Kim1
30 Jan 2009
TL;DR: In this article, a MIMO channel frequency response matrix is decomposed into a frequency-related part and a constant part, where the constant part is independent of subcarrier index and of number of sub-carriers in one symbol interval.
Abstract: A MIMO channel frequency response matrix is decomposed into a frequency- related part and a constant part. The constant part is independent of subcarrier index and of number of subcarriers in one symbol interval. Separated QR decomposition and either SVD or GMD is applied to the two parts. A right unitary matrix (R) is obtained from the SVD or GMD applied to the constant part. QR decomposition is applied to the constant part to generate a beamforming matrix (V). In another embodiment, a selection criterion based on a correlation matrix distance is used to select a beamforming matrix that is independent of subcarrier, the selected matrix is retrieved from a local memory and applied to a received signal. Noise covariance is computed for a noise expression which considers interference generated from the applied beamforming matrix. Data detection is performed on the received signal by a MIMO data detector using the noise covariance.

Proceedings Article
01 Jan 2009
TL;DR: SuiteSparseQR is a sparse multifrontal QR factorization algorithm that obtains a substantial fraction of the theoretical peak performance of a multicore computer.
Abstract: SuiteSparseQR is a sparse multifrontal QR factorization algorithm. Dense matrix methods within each frontal matrix enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel's Threading Building Blocks library. Rank-detection is performed within each frontal matrix using Heath's method, which does not require column pivoting. The resulting sparse QR factorization obtains a substantial fraction of the theoretical peak performance of a multicore computer.

Journal ArticleDOI
TL;DR: The results indicate that the PSAI algorithm is at least comparable to and can be much more effective than the adaptive SPAI algorithm and it often outperforms the static SAI algorithms very considerably and is more robust and practical than the static ones for general problems.
Abstract: Motivated by the Cayley–Hamilton theorem, a novel adaptive procedure, called a Power Sparse Approximate Inverse (PSAI) procedure, is proposed that uses a different adaptive sparsity pattern selection approach to constructing a right preconditioner M for the large sparse linear system Ax=b. It determines the sparsity pattern of M dynamically and computes the n independent columns of M that is optimal in the Frobenius norm minimization, subject to the sparsity pattern of M . The PSAI procedure needs a matrix–vector product at each step and updates the solution of a small least squares problem cheaply. To control the sparsity of M and develop a practical PSAI algorithm, two dropping strategies are proposed. The PSAI algorithm can capture an effective approximate sparsity pattern of A−1 and compute a good sparse approximate inverse M efficiently. Numerical experiments are reported to verify the effectiveness of the PSAI algorithm. Numerical comparisons are made for the PSAI algorithm and the adaptive SPAI algorithm proposed by Grote and Huckle as well as for the PSAI algorithm and three static Sparse Approximate Inverse (SAI) algorithms. The results indicate that the PSAI algorithm is at least comparable to and can be much more effective than the adaptive SPAI algorithm and it often outperforms the static SAI algorithms very considerably and is more robust and practical than the static ones for general problems. Copyright q 2008 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
07 Jun 2009
TL;DR: Unified hardware architecture for fast, area efficient QR factorization based on the Householder transformation is presented and the design and implementation of the proposed hardware is presented with synthesis results based on FPGA hardware.
Abstract: The QR factorization is used in many signal processing and communication applications such as echo cancellation, adaptive beamforming and multiple-inputmultiple- output (MIMO) systems. However, division, square root and inverse square root operations required by the QR algorithm are very difficult to implement because they are computationally slow and area-consuming arithmetic operations. This paper presents unified hardware architecture for fast, area efficient QR factorization based on the Householder transformation. Newton-Raphson, and Goldschmidt algorithms are used for fast division, square root and inverse square root blocks. By using a unified architecture, area and power requirements for QR factorization are reduced without decreasing overall speed. The design and implementation of the proposed hardware is presented with synthesis results based on FPGA hardware.

Journal ArticleDOI
TL;DR: The collocation points are placed on concentric spheres and thus the resulting global matrix possesses a block circulant structure, which is exploited to develop an efficient matrix decomposition algorithm for the solution of the resulting system.
Abstract: In this study, we propose an efficient algorithm for the evaluation of the particular solutions of three-dimensional inhomogeneous elliptic partial differential equations using radial basis functions. The collocation points are placed on concentric spheres and thus the resulting global matrix possesses a block circulant structure. This structure is exploited to develop an efficient matrix decomposition algorithm for the solution of the resulting system. Further savings in the matrix decomposition algorithm are obtained by the use of fast Fourier transforms. The proposed algorithm is used, in conjunction with the method of fundamental solutions for the solution of three-dimensional inhomogeneous elliptic boundary value problems.

Journal ArticleDOI
TL;DR: An adaptive selection algorithm for the surviving symbol replica candidates (ASESS) based on the maximum reliability in maximum likelihood detection with QR decomposition and the M-algorithm and the QRM-MLD for Orthogonal Frequency Division Multiplexing (OFDM) multiple-input multiple-output (MIMO) multiplexing is proposed.
Abstract: This paper proposes an adaptive selection algorithm for the surviving symbol replica candidates (ASESS) based on the maximum reliability in maximum likelihood detection with QR decomposition and the M-algorithm (QRM-MLD) for Orthogonal Frequency Division Multiplexing (OFDM) multiple-input multiple-output (MIMO) multiplexing. In the proposed algorithm, symbol replica candidates newly-added at each stage are ranked for each surviving symbol replica from the previous stage using multiple quadrant detection. Then, branch metrics are calculated only for the minimum number of symbol replica candidates with a high level of reliability using an iterative loop based on symbol ranking results. Computer simulation results show that the computational complexity of the QRM-MLD employing the proposed ASESS algorithm is reduced to approximately 1/4 and 1/1200 compared to that of the original QRM-MLD and that of the conventional MLD with squared Euclidian distance calculations for all symbol replica candidates, respectively, assuming the identical achievable average packet error rate (PER) performance in 4-by-4 MIMO multiplexing with 16QAM data modulation. The results also show that 1-Gbps throughput is achieved at the average received signal energy per bit-to-noise power spectrum density ratio (Eb/N0) per receiver antenna of approximately 9dB using the ASESS algorithm in QRM-MLD associated with 16QAM modulation and Turbo coding with the coding rate of 8/9 assuming a 100-MHz bandwidth for a 12-path Rayleigh fading channel (root mean square (r.m.s.) delay spread of 0.26µs and maximum Doppler frequency of 20Hz).

Journal ArticleDOI
TL;DR: A parallel distributed solver that enables us to solve incremental dense least squares arising in some parameter estimation problems and uses a recently defined distributed packed format that handles symmetric or triangular matrices in ScaLAPACK-based implementations.
Abstract: We present a parallel distributed solver that enables us to solve incremental dense least squares arising in some parameter estimation problems. This solver is based on ScaLAPACK [8] and PBLAS [9] kernel routines. In the incremental process, the observations are collected periodically and the solver updates the solution with new observations using a QR factorization algorithm. It uses a recently defined distributed packed format [3] that handles symmetric or triangular matrices in ScaLAPACK-based implementations. We provide performance analysis on IBM pSeries 690. We also present an example of application in the area of space geodesy for gravity field computations with some experimental results.