scispace - formally typeset
Search or ask a question

Showing papers on "QR decomposition published in 2011"


Journal ArticleDOI
TL;DR: This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation, and presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions.
Abstract: Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, robustness, and/or speed. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the $k$ dominant components of the singular value decomposition of an $m \times n$ matrix. (i) For a dense input matrix, randomized algorithms require $\bigO(mn \log(k))$ floating-point operations (flops) in contrast to $ \bigO(mnk)$ for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multiprocessor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to $\bigO(k)$ passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.

3,248 citations


Book
05 Oct 2011
TL;DR: This book discusses Continuation Methods, Newton's Method and Orthogonal Decompositions Revisited, and Update Methods and their Numerical Stability.
Abstract: 1 Introduction.- 2 The Basic Principles of Continuation Methods.- 2.1 Implicitly Defined Curves.- 2.2 The Basic Concepts of PC Methods.- 2.3 The Basic Concepts of PL Methods.- 3 Newton's Method as Corrector.- 3.1 Motivation.- 3.2 The Moore-Penrose Inverse in a Special Case.- 3.3 A Newton's Step for Underdetermined Nonlinear Systems.- 3.4 Convergence Properties of Newton's Method.- 4 Solving the Linear Systems.- 4.1 Using a QR Decomposition.- 4.2 Givens Rotations for Obtaining a QR Decomposition.- 4.3 Error Analysis.- 4.4 Scaling of the Dependent Variables.- 4.5 Using LU Decompositions.- 5 Convergence of Euler-Newton-Like Methods.- 5.1 An Approximate Euler-Newton Method.- 5.2 A Convergence Theorem for PC Methods.- 6 Steplength Adaptations for the Predictor.- 6.1 Steplength Adaptation by Asymptotic Expansion.- 6.2 The Steplength Adaptation of Den Heijer & Rheinboldt.- 6.3 Steplength Strategies Involving Variable Order Predictors.- 7 Predictor-Corrector Methods Using Updating.- 7.1 Broyden's "Good" Update Formula.- 7.2 Broyden Updates Along a Curve.- 8 Detection of Bifurcation Points Along a Curve.- 8.1 Simple Bifurcation Points.- 8.2 Switching Branches Via Perturbation.- 8.3 Branching Off Via the Bifurcation Equation.- 9 Calculating Special Points of the Solution Curve.- 9.1 Introduction.- 9.2 Calculating Zero Points f(c(s)) = 0.- 9.3 Calculating Extremal Points minsf((c(s)).- 10 Large Scale Problems.- 10.1 Introduction.- 10.2 General Large Scale Solvers.- 10.3 Nonlinear Conjugate Gradient Methods as Correctors.- 11 Numerically Implementable Existence Proofs.- 11.1 Preliminary Remarks.- 11.2 An Example of an Implementable Existence Theorem.- 11.3 Several Implementations for Obtaining Brouwer Fixed Points.- 11.4 Global Newton and Global Homotopy Methods.- 11.5 Multiple Solutions.- 11.6 Polynomial Systems.- 11.7 Nonlinear Complementarity.- 11.8 Critical Points and Continuation Methods.- 12 PL Continuation Methods.- 12.1 Introduction.- 12.2 PL Approximations.- 12.3 A PL Algorithm for Tracing H(u) = 0.- 12.4 Numerical Implementation of a PL Continuation Algorithm.- 12.5 Integer Labeling.- 12.6 Truncation Errors.- 13 PL Homotopy Algorithms.- 13.1 Set-Valued Maps.- 13.2 Merrill's Restart Algorithm.- 13.3 Some Triangulations and their Implementations.- 13.4 The Homotopy Algorithm of Eaves & Saigal.- 13.5 Mixing PL and Newton Steps.- 13.6 Automatic Pivots for the Eaves-Saigal Algorithm.- 14 General PL Algorithms on PL Manifolds.- 14.1 PL Manifolds.- 14.2 Orientation and Index.- 14.3 Lemke's Algorithm for the Linear Complementarity Problem.- 14.4 Variable Dimension Algorithms.- 14.5 Exploiting Special Structure.- 15 Approximating Implicitly Defined Manifolds.- 15.1 Introduction.- 15.2 Newton's Method and Orthogonal Decompositions Revisited.- 15.3 The Moving Frame Algorithm.- 15.4 Approximating Manifolds by PL Methods.- 15.5 Approximation Estimates.- 16 Update Methods and their Numerical Stability.- 16.1 Introduction.- 16.2 Updates Using the Sherman-Morrison Formula.- 16.3 QR Factorization.- 16.4 LU Factorization.- P1 A Simple PC Continuation Method.- P2 A PL Homotopy Method.- P3 A Simple Euler-Newton Update Method.- P4 A Continuation Algorithm for Handling Bifurcation.- P5 A PL Surface Generator.- P6 SCOUT - Simplicial Continuation Utilities.- P6.1 Introduction.- P6.2 Computational Algorithms.- P6.3 Interactive Techniques.- P6.4 Commands.- P6.5 Example: Periodic Solutions to a Differential Delay Equation.- Index and Notation.

1,143 citations


01 Jan 2011
TL;DR: In this article, the authors present a modular framework for constructing randomized algorithms that compute partial matrix decompositions, which use random sampling to identify a subspace that captures most of the action of a matrix.
Abstract: Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that ran- domization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, robustness, and/or speed. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast to O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi- processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.

494 citations


Journal ArticleDOI
TL;DR: SuiteSparseQR is a sparse QR factorization package based on the multifrontal method that obtains a substantial fraction of the theoretical peak performance of a multicore computer.
Abstract: SuiteSparseQR is a sparse QR factorization package based on the multifrontal method Within each frontal matrix, LAPACK and the multithreaded BLAS enable the method to obtain high performance on multicore architectures Parallelism across different frontal matrices is handled with Intel's Threading Building Blocks library The symbolic analysis and ordering phase pre-eliminates singletons by permuting the input matrix A into the form [R11R12; 0 A22] where R11 is upper triangular with diagonal entries above a given tolerance Next, the fill-reducing ordering, column elimination tree, and frontal matrix structures are found without requiring the formation of the pattern of ATA Approximate rank-detection is performed within each frontal matrix using Heath's method While Heath's method is not always exact, it has the advantage of not requiring column pivoting and thus does not interfere with the fill-reducing ordering For sufficiently large problems, the resulting sparse QR factorization obtains a substantial fraction of the theoretical peak performance of a multicore computer

241 citations


Journal ArticleDOI
TL;DR: Hong and Kung as discussed by the authors gave a lower bound on the communication complexity of matrix multiplication in the parallel case. But this lower bound was later extended to a much wider variety of linear algebra algorithms, including LU factorization, Cholesky factorization and LDLT factorization.
Abstract: In 1981 Hong and Kung proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo, and Tiskin gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic_operations/M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLT factorization, QR factorization, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices and for sequential or para...

212 citations


Journal ArticleDOI
TL;DR: A QR decomposition scheme by cascading one complex-value and one real-value Givens rotation stages is proposed, which can save 44% hardware complexity and achieves the highest throughput with high efficiency.
Abstract: This paper presents a VLSI architecture of QR decomposition for 4×4 MIMO-OFDM systems. A real-value decomposed MIMO system model is handled and thus the channel matrix to be processed is extended to the size of 8×8. Instead of direct factorization, a QR decomposition scheme by cascading one complex-value and one real-value Givens rotation stages is proposed, which can save 44% hardware complexity. Besides, the requirement of skewed inputs in the conventional QR-decomposition systolic array is eliminated and 36% of delay elements are removed. The real-value Givens rotation stage is also constructed in a form of a stacked triangular systolic array to match with the throughput of the complex-value one. Hardware sharing is considered to enhance the utilization. The proposed design is implemented in 0.18-μm CMOS technology with 152K gates. From measurement, the maximum operating frequency is 100 MHz. It generates QR decomposition results every four clock cycles and accomplishes continuous projection every clock cycle to support MIMO detection up to 2.4 Gb/s. The measured power consumption is 318.6 mW and 219.6 mW for QR decomposition and projection, respectively, at the highest operating frequency. From the comparison, our proposed design achieves the highest throughput with high efficiency.

116 citations


Proceedings ArticleDOI
16 May 2011
TL;DR: An implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU) is described, which shows that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices.
Abstract: We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel's Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel's Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.

109 citations


Journal ArticleDOI
TL;DR: A new QR decomposition algorithm which overcomes critical limitations in other QR algorithms that prohibits their application to MIMO systems and is used to create the first reported architecture capable of supporting real-time 802.11n operation.
Abstract: Real-time matrix inversion is a key enabling technology in multiple-input multiple-output (MIMO) communications systems, such as 802.11n. To date, however, no matrix inversion implementation has been devised which supports real-time operation for these standards. In this paper, we overcome this barrier by presenting a novel matrix inversion algorithm which is ideally suited to high performance floating-point implementation. We show how the resulting architecture offers fundamentally higher performance than currently published matrix inversion approaches and we use it to create the first reported architecture capable of supporting real-time 802.11n operation. Specifically, we present a matrix inversion approach based on modified squared Givens rotations (MSGR). This is a new QR decomposition algorithm which overcomes critical limitations in other QR algorithms that prohibits their application to MIMO systems. In addition, we present a novel modification that further reduces the complexity of MSGR by almost 20%. This enables real-time implementation with negligible reduction in the accuracy of the inversion operation, or the BER of a MIMO receiver based on this.

89 citations


Journal ArticleDOI
TL;DR: Using the concept of Geometric Weakly Admissible Meshes (see §2 below) together with an algorithm based on the classical QR factorization of matrices, the authors compute efficient points for discrete multivariate least squares approximation and Lagrange interpolation.
Abstract: Using the concept of Geometric Weakly Admissible Meshes (see §2 below) together with an algorithm based on the classical QR factorization of matrices, we compute efficient points for discrete multivariate least squares approximation and Lagrange interpolation.

82 citations


Proceedings ArticleDOI
08 Jun 2011
TL;DR: An implementation of the tall and skinny QR (TSQR) factorization in the MapReduce framework is presented, and computational results for nearly terabyte-sized datasets are provided.
Abstract: The QR factorization is one of the most important and useful matrix factorizations in scientific computing. A recent communication-avoiding version of the QR factorization trades flops for messages and is ideal for MapReduce, where computationally intensive processes operate locally on subsets of the data. We present an implementation of the tall and skinny QR (TSQR) factorization in the MapReduce framework, and we provide computational results for nearly terabyte-sized datasets. These tasks run in just a few minutes under a variety of parameter choices.

62 citations


Journal ArticleDOI
TL;DR: A novel pilot-aided iterative algorithm is developed for MIMO-OFDM systems operating in fast time-varying environment and it is shown that only one iteration is sufficient to approach the performance of the ideal case for which the knowledge of the channel response and CFO is available.
Abstract: In this paper, a novel pilot-aided algorithm is developed for multiple-input-multiple-output (MIMO) orthogonal frequency-division-multiplexing (OFDM) systems operating in a fast time-varying environment. The algorithm has been designed to work with both the parametric L -path channel model (with known path delays) and the equivalent discrete-time channel model to jointly estimate the multipath Rayleigh channel complex amplitude (CA) and the carrier frequency offset (CFO). Each CA time variation within one OFDM symbol is approximated by a basis expansion model representation. An autoregressive model is built for the parameters to be estimated. The algorithm performs estimation using extended Kalman filtering. The channel matrix is thus easily computed, and the data symbol is estimated without intercarrier interference (ICI) when the channel matrix is QR-decomposed. It is shown that our algorithm is far more robust to high speed than the conventional algorithm, and the performance approaches that of the ideal case for which the channel response and CFO are known.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the watermarked images have good visual quality and this scheme is better than the existing techniques, especially when the image is attacked by cropping, noise pollution and so on.
Abstract: In order to protect copyright of digital images, a new robust digital image watermarking algorithm based on chaotic system and QR factorization was proposed. The host images were firstly divided into blocks with same size, then QR factorization was performed on each block. Pseudorandom circular chain (PCC) generated by logistic mapping (LM) was applied to select the embedding blocks for enhancing the security of the scheme. The first column coefficients in Q matrix of chosen blocks were modified to embed watermarks without causing noticeable artifacts. Watermark extraction procedure was performed without the original cover image. The experimental results demonstrate that the watermarked images have good visual quality and this scheme is better than the existing techniques, especially when the image is attacked by cropping, noise pollution and so on. Analysis and discussion on robustness and security issues were also presented.

Journal ArticleDOI
TL;DR: In this article, a data-driven adaptive model-based predictive controller (MBPC) with input constraints is proposed, which employs subspace identification technique and a singular value decomposition (SVD)-based optimisation strategy to formulate the control algorithm and incorporate the input constraints.
Abstract: This study is concerned with the development of a new data-driven adaptive model-based predictive controller (MBPC) with input constraints. The proposed methods employ subspace identification technique and a singular value decomposition (SVD)-based optimisation strategy to formulate the control algorithm and incorporate the input constraints. Both direct adaptive model-based predictive controller (DAMBPC) and indirect adaptive model-based predictive controller (IAMBPC) are considered. In DAMBPC, the direct identification of controller parameters is desired to reduce the design effort and computational load while the IAMBPC involves a two-stage process of model identification and controller design. The former method only requires a single QR decomposition for obtaining the controller parameters and uses a receding horizon approach to process input/output data for the identification. A suboptimal SVD-based optimisation technique is proposed to incorporate the input constraints. The proposed techniques are implemented and tested on a fourth order non-linear model of a wastewater system. Simulation results are presented to compare the direct and indirect adaptive methods and to demonstrate the performance of the proposed algorithms.

Journal ArticleDOI
TL;DR: A modified algorithm that possesses a scalable property to save the power consumption for interpolation-based QR decomposition in the variable-rank MIMO scheme is proposed.
Abstract: This paper presents a modified interpolation-based QR decomposition algorithm for the grouped-ordering multiple input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems. Based on the original research that integrates the calculations of the frequency-domain channel estimation and the QR decomposition for the MIMO-OFDM system, this study proposes a modified algorithm that possesses a scalable property to save the power consumption for interpolation-based QR decomposition in the variable-rank MIMO scheme. Furthermore, we also develop the general equations and a timing scheduling method for the hardware design of the proposed QR decomposition processor for the higher-dimension MIMO system. Based on the pro posed algorithm, a configurable interpolation-based QR decomposition and channel estimation processor was designed and implemented using a 90-nm one-poly nine-metal CMOS technology. The processor supports 2 × 2, 2 × 4 and 4 × 4 QR-based MIMO detection for the 3GPP-LTE MIMO-OFDM system and achieves the throughput of 35.16 MQRD/s at its maximum clock rate 140.65 MHz.

Proceedings ArticleDOI
30 Nov 2011
TL;DR: The evaluation results show that the proposed systolic array satisfies 99.9% correct 4×4 QR decomposition for the 2-13 accuracy requirement when the word length of the data path is lager than 25-bit.
Abstract: This paper presents a parallel architecture of an QR decomposition systolic array based on the Givens rotations algorithm on FPGA. The proposed architecture adopts a direct mapping by 21 fixed-point CORDIC-based process units that can compute the QR decomposition for an 4×4 real matrix. In order to achieve a comprehensive resource and performance evaluation, the computational error analysis, the resource utilized, and speed achieved on Virtex5 XC5VTX150T FPGA, are evaluated with the different precision of the intermediate word lengthes. The evaluation results show that 1) the proposed systolic array satisfies 99.9% correct 4×4 QR decomposition for the 2-13 accuracy requirement when the word length of the data path is lager than 25-bit, 2) occupies about 2, 810 (13%) slices, and achieves about 2.06 M/sec updates by running at the maximum frequency 111 MHz.

Journal ArticleDOI
TL;DR: The results show that very coarse approximations are sufficient for reasonable positioning accuracy and the presented method reduces the computational complexity significantly and is highly suited for hardware implementation.
Abstract: . The efficient implementation of positioning algorithms is investigated for Global Positioning System (GPS). In order to do the positioning, the pseudoranges between the receiver and the satellites are required. The most commonly used algorithm for position computation from pseudoranges is non-linear Least Squares (LS) method. Linearization is done to convert the non-linear system of equations into an iterative procedure, which requires the solution of a linear system of equations in each iteration, i.e. linear LS method is applied iteratively. CORDIC-based approximate rotations are used while computing the QR decomposition for solving the LS problem in each iteration. By choosing accuracy of the approximation, e.g. with a chosen number of optimal CORDIC angles per rotation, the LS computation can be simplified. The accuracy of the positioning results is compared for various numbers of required iterations and various approximation accuracies using real GPS data. The results show that very coarse approximations are sufficient for reasonable positioning accuracy. Therefore, the presented method reduces the computational complexity significantly and is highly suited for hardware implementation.

Journal ArticleDOI
TL;DR: This paper considers the preliminary unitary similarity transformation to Hessenberg form, a proof of uniqueness of this reduction, an extension of theCMV-decomposition to a double Hessenberg factorization, and an explicit and implicit $QR$-type algorithm.
Abstract: The $QR$-algorithm is a renowned method for computing all eigenvalues of an arbitrary matrix A preliminary unitary similarity transformation to Hessenberg form is indispensable for keeping the computational complexity of the subsequent $QR$-steps under control When restraining computing time is the vital issue, we observe that the prominent role played by the Hessenberg matrix is sufficient but perhaps not necessary to fulfill this goal In this paper, a whole new family of matrices, sharing the major qualities of Hessenberg matrices, will be put forward This gives rise to the development of innovative implicit $QR$-type algorithms, pursuing rotations instead of bulges The key idea is to benefit from the $QR$-factorization of the matrices involved The prescribed order of rotations in the decomposition of the $Q$-factor uniquely characterizes several matrix types such as Hessenberg, inverse Hessenberg, and $CMV$ matrices Loosening the fixed ordering of these rotations provides us the class of matrices under consideration Establishing a new implicit $QR$-type algorithm for these matrices requires a generalization of diverse well-established concepts We consider the preliminary unitary similarity transformation, a proof of uniqueness of this reduction, an extension of the $CMV$-decomposition to a double Hessenberg factorization, and an explicit and implicit $QR$-type algorithm A detailed complexity analysis illustrates the competitiveness of the novel method with the traditional Hessenberg approach The numerical experiments show comparable accuracy for a wide variety of matrix types, but disclose an intriguing difference between the average number of iterations before deflation can be applied

Book ChapterDOI
11 Sep 2011
TL;DR: This work investigates randomized algorithms based on gossiping for the distributed computation of the QR factorization and illustrates that the randomized approaches are well suited for distributed systems with arbitrary topology and potentially unreliable communication, where approaches with fixed communication schedules have major drawbacks.
Abstract: Most parallel algorithms for matrix computations assume a static network with reliable communication and thus use fixed communication schedules. However, in situations where computer systems may change dynamically, in particular, when they have unreliable components, algorithms with randomized communication schedule may be an interesting alternative. We investigate randomized algorithms based on gossiping for the distributed computation of the QR factorization. The analyses of numerical accuracy showed that known numerical properties of classical sequential and parallel QR decomposition algorithms are preserved. Moreover, we illustrate that the randomized approaches are well suited for distributed systems with arbitrary topology and potentially unreliable communication, where approaches with fixed communication schedules have major drawbacks. The communication overhead compared to the optimal parallel QR decomposition algorithm (CAQR) is analyzed. The randomized algorithms have a much higher potential for trading off numerical accuracy against performance because their accuracy is proportional to the amount of communication invested.

Journal ArticleDOI
TL;DR: An in-depth complexity analysis shows that the proposed algorithms, for a sufficiently large number of data-carrying tones and sufficiently small channel order, provably exhibit significantly smaller complexity than brute-force per-tone QR decomposition.
Abstract: Detection algorithms for multiple-input multiple-output (MIMO) wireless systems based on orthogonal frequency-division multiplexing (OFDM) typically require the computation of a QR decomposition for each of the data-carrying OFDM tones. The resulting computational complexity will, in general, be significant. Motivated by the fact that the channel matrices arising in MIMO-OFDM systems result from oversampling of a polynomial matrix, we formulate interpolation-based QR decomposition algorithms. An in-depth complexity analysis, based on a metric relevant for very large scale integration (VLSI) implementations, shows that the proposed algorithms, for a sufficiently large number of data-carrying tones and sufficiently small channel order, provably exhibit significantly smaller complexity than brute-force per-tone QR decomposition.

Journal ArticleDOI
TL;DR: An efficient hierarchical (H-) LU decomposition algorithm based on the H-matrix techniques is proposed to handle the problem of convergence rate and is very robust for the analysis of various planar layered structures.
Abstract: The matrix decomposition algorithm (MDA) provides an efficient matrix-vector product for the iterative solution of the integral equation (IE) by a blockwise compression of the impedance matrix. The MDA with a singular value decomposition (SVD) recompression scheme, i.e., so-called MDA-SVD method, shows strong ability for the analysis of planar layered structures. However, iterative solution faces the problem of convergence rate. An efficient hierarchical (H-) LU decomposition algorithm based on the H-matrix techniques is proposed to handle this problem. Exploiting the data-sparse representation of the MDA-SVD compressed impedance matrix, H -LU decomposition can be efficiently implemented by H-matrix arithmetic. H-matrix techniques provide a flexible way to control the accuracy of the approximate H-LU-factors. H-LU decomposition with low accuracy can be used as an efficient preconditioner for the iterative solver due to its low computational cost, while H-LU decomposition with high accuracy can be used as a direct solver for dealing with multiple right-hand-side (RHS) vector problems particularly. Numerical examples demonstrate that the proposed method is very robust for the analysis of various planar layered structures.

Posted Content
TL;DR: Both FIBONACCI and GREEDY are shown to be asymptotically optimal for all matrices of size p = q2 f(q), where f is any function such that lim+∞ f = 0.
Abstract: This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p-by-q tiles, where p >= q. Within this framework, we study the critical paths and performance of algorithms such as Sameh and Kuck, Modi and Clarke, Greedy, and those found within PLASMA. Although neither Modi and Clarke nor Greedy is optimal, both are shown to be asymptotically optimal for all matrices of size p = q^2 f(q), where f is any function such that \lim_{+\infty} f= 0. This novel and important complexity result applies to all matrices where p and q are proportional, p = \lambda q, with \lambda >= 1, thereby encompassing many important situations in practice (least squares). We provide an extensive set of experiments that show the superiority of the new algorithms for tall matrices.

Journal ArticleDOI
TL;DR: This paper constructs a simultaneous decomposition to a matrix triplet (A, B, C), where A=±A*.
Abstract: Researches on ranks of matrix expressions have posed a number of challenging questions, one of which is concerned with simultaneous decompositions of several given matrices. In this paper, we construct a simultaneous decomposition to a matrix triplet (A, B, C), where A=±A*. Through the simultaneous matrix decomposition, we derive a canonical form for the matrix expressions A−BXB*−CYC* and then solve two conjectures on the maximal and minimal possible ranks of A−BXB*−CYC* with respect to X=±X* and Y=±Y*. As an application, we derive a sufficient and necessary condition for the matrix equation BXB* + CYC*=A to have a pair of Hermitian solutions, and then give the general Hermitian solutions to the matrix equation. Copyright © 2010 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, a nonlinear reduced-order model for fluid-structure interaction problems is investigated for unsteady compressible flows excited by the rigid body motion of a structure.

Proceedings ArticleDOI
06 Nov 2011
TL;DR: A new method is developed for reducing the computational time and improving numerical stability of algorithms using this set of algebraic conditions which allow us to reduce the size of the elimination template (polynomial coefficient matrix), which leads to faster LU or QR decomposition.
Abstract: In recent years polynomial solvers based on algebraic geometry techniques, and specifically the action matrix method, have become popular for solving minimal problems in computer vision. In this paper we develop a new method for reducing the computational time and improving numerical stability of algorithms using this method. To achieve this, we propose and prove a set of algebraic conditions which allow us to reduce the size of the elimination template (polynomial coefficient matrix), which leads to faster LU or QR decomposition. Our technique is generic and has potential to improve performance of many solvers that use the action matrix method. We demonstrate the approach on specific examples, including an image stitching algorithm where computation time is halved and single precision arithmetic can be used.

Journal ArticleDOI
TL;DR: It is proved that under a noise-free environment, the code design proposed in this paper can guarantee that the transmitted signals and the channel coefficients are uniquely identified, and thatunder a complex Gaussian noise environment in which Pth-order and Q th-order statistics are available, the channel coefficient can be still uniquely identified.
Abstract: In this paper, the systematic design of a space-time block code is considered for a wireless communication system with multiple transmitter-receiver antennas and flat fading, in which channel state information is completely unknown. From the viewpoint of blind signal processing, a necessary and sufficient condition is given for the unique identification of the multi-input multi-output flat fading channel and transmitted signal. Then, some novel unique factorizations for a pair of coprime P-ary and Q-ary phase shift keying (PSK) constellations are established. With this and currently available coherent space-time block code designs, a method is developed to systematically construct full diversity blind nonunitary space-time block codes as well as unitary codes by just performing the QR decomposition of the nonunitary codes. It is proved that under a noise-free environment, the code design proposed in this paper can guarantee that the transmitted signals and the channel coefficients are uniquely identified, and that under a complex Gaussian noise environment in which Pth-order and Q th-order statistics (P and Q are co-prime) of the received signals are available, the channel coefficients can be still uniquely identified. In addition, a closed-form solution to determine the channel coefficients is obtained.

Proceedings ArticleDOI
16 May 2011
TL;DR: This Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization.
Abstract: Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, "communication" includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches for orthogonalizing the vectors within each block ("normalization"). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5 -- 20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.

Posted Content
TL;DR: In this paper, a QR factorization algorithm for massively parallel platforms combining parallel distributed multi-core nodes is presented, which enables good data locality for the sequential kernels executed by the cores, low number of messages in a parallel distributed setting (small latency term), and fine granularity.
Abstract: This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism).

Patent
20 Apr 2011
TL;DR: In this paper, a linear precoding method in a multiuser multi-input and multi-output (MIMO) system is proposed, which comprises the following steps: determining the interference channel transmission matrix in the system of each user according to the channel estimation results, wherein, k is user index and k is in the range from 1 to K, and K is the user number served by the system base station simultaneously in the same band range.
Abstract: The invention discloses a linear precoding method in a multiuser multi-input and multi-output (MIMO) system, which comprises the following steps: determining the interference channel transmission matrix HK in the system of each user according to the channel estimation results, wherein, k is user index, k is in the range from 1 to K, and K is the user number served by the system base station simultaneously in the same band range; carrying out the QR decomposition to the conjugate transposed matrix HK of a random user's interference channel transmission matrix HK, and forming the user's linear precoding matrix Tk according to the QR decomposition result; and carrying out the linear precoding to each user's emission signal sk respectively by utilizing the formed linear precoding matrix Tk.

Patent
22 Jun 2011
TL;DR: In this paper, a multi-user multiple input multiple output (MU-MIMO) transmission method in a wireless communication system, a base station and a user terminal is presented.
Abstract: The invention discloses a multi-user multiple input multiple output (MU-MIMO) transmission method in a wireless communication system, a base station and a user terminal. The method comprises the following steps that: the base station receives detection pilot frequency SRS of N users to perform channel estimation and acquires downlink channel information according to the channel estimation result and channel reciprocity of the system, wherein N is more than 1; the base station performs quick response (QR) decomposition on the downlink channel information, acquires a multi-user beamforming (MU-BF) matrix P (i) of the ith user from a Q matrix acquired through decomposition, and acquires a downlink single-user beamforming (SU-BF) matrix V (i) of the ith user further, wherein i=1, ...,N; and the base station performs beamforming processing on transmitting data of the ith user according to the MU-BF matrix P (i) and the SU-BF matrix V (i). The method and equipment acquire beamforming matrixes for uplink and downlink MU-MIMO transmission by means of the channel reciprocity of the system and the QR decomposition, and the MU-MIMO transmission performance can be improved.

Journal ArticleDOI
TL;DR: A partitioned algorithm for reducing a symmetric matrix to a tridiagonal form, with partial pivoting, that is, the algorithm computes a factorization PAPT = LTLT, which is componentwise backward stable and solves linear systems of equations using the computed factorization.
Abstract: We present a partitioned algorithm for reducing a symmetric matrix to a tridiagonal form, with partial pivoting. That is, the algorithm computes a factorization PAPT = LTLT, where, P is a permutation matrix, L is lower triangular with a unit diagonal and entries’ magnitudes bounded by 1, and T is symmetric and tridiagonal. The algorithm is based on the basic (nonpartitioned) methods of Parlett and Reid and of Aasen. We show that our factorization algorithm is componentwise backward stable (provided that the growth factor is not too large), with a similar behavior to that of Aasen’s basic algorithm. Our implementation also computes the QR factorization of T and solves linear systems of equations using the computed factorization. The partitioning allows our algorithm to exploit modern computer architectures (in particular, cache memories and high-performance blas libraries). Experimental results demonstrate that our algorithms achieve approximately the same level of performance as the partitioned Bunch-Kaufman factor and solve routines in lapack.