scispace - formally typeset
Search or ask a question

Showing papers on "QR decomposition published in 2014"


Proceedings ArticleDOI
Moritz Hardt1
18 Oct 2014
TL;DR: A new algorithm based on alternating minimization is given that provably recovers an unknown low-rank matrix from a random subsample of its entries under a standard incoherence assumption and gives the strongest sample bounds among all subquadratic time algorithms that are aware of.
Abstract: Alternating minimization is a widely used and empirically successful heuristic for matrix completion and related low-rank optimization problems. Theoretical guarantees for alternating minimization have been hard to come by and are still poorly understood. This is in part because the heuristic is iterative and non-convex in nature. We give a new algorithm based on alternating minimization that provably recovers an unknown low-rank matrix from a random subsample of its entries under a standard incoherence assumption. Our results reduce the sample size requirements of the alternating minimization approach by at least a quartic factor in the rank and the condition number of the unknown matrix. These improvements apply even if the matrix is only close to low-rank in the Frobenius norm. Our algorithm runs in nearly linear time in the dimension of the matrix and, in a broad range of parameters, gives the strongest sample bounds among all subquadratic time algorithms that we are aware of. Underlying our work is a new robust convergence analysis of the well-known Power Method for computing the dominant singular vectors of a matrix. This viewpoint leads to a conceptually simple understanding of alternating minimization. In addition, we contribute a new technique for controlling the coherence of intermediate solutions arising in iterative algorithms based on a smoothed analysis of the QR factorization. These techniques may be of interest beyond their application here.

197 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed blind image watermarking scheme has stronger robustness against most common attacks such as image compression, filtering, cropping, noise adding, blurring, scaling and sharpening etc.

157 citations


Journal ArticleDOI
TL;DR: A simple and efficient method for the denoising of very large datasets, based on the QR decomposition of a matrix randomly sampled from the data, that allows a gain of nearly three orders of magnitude in processing time compared with classical singular value decomposition denoisation.
Abstract: Modern scientific research produces datasets of increasing size and complexity that require dedicated numerical methods to be processed. In many cases, the analysis of spectroscopic data involves the denoising of raw data before any further processing. Current efficient denoising algorithms require the singular value decomposition of a matrix with a size that scales up as the square of the data length, preventing their use on very large datasets. Taking advantage of recent progress on random projection and probabilistic algorithms, we developed a simple and efficient method for the denoising of very large datasets. Based on the QR decomposition of a matrix randomly sampled from the data, this approach allows a gain of nearly three orders of magnitude in processing time compared with classical singular value decomposition denoising. This procedure, called urQRd (uncoiled random QR denoising), strongly reduces the computer memory footprint and allows the denoising algorithm to be applied to virtually unlimited data size. The efficiency of these numerical tools is demonstrated on experimental data from high-resolution broadband Fourier transform ion cyclotron resonance mass spectrometry, which has applications in proteomics and metabolomics. We show that robust denoising is achieved in 2D spectra whose interpretation is severely impaired by scintillation noise. These denoising procedures can be adapted to many other data analysis domains where the size and/or the processing time are crucial.

68 citations


Proceedings ArticleDOI
08 Dec 2014
TL;DR: This paper proposes a robust method to solve the absolute rotation estimation problem, which arises in global registration of 3D point sets and in structure-from-motion, by casting the problem as a "low-rank & sparse" matrix decomposition.
Abstract: This paper proposes a robust method to solve the absolute rotation estimation problem, which arises in global registration of 3D point sets and in structure-from-motion. A novel cost function is formulated which inherently copes with outliers. In particular, the proposed algorithm handles both outlier and missing relative rotations, by casting the problem as a "low-rank a sparse" matrix decomposition. As a side effect, this solution can be seen as a valid and cost-effective detector of inconsistent pair wise rotations. Computational efficiency and numerical accuracy, are demonstrated by simulated and real experiments.

53 citations


Journal ArticleDOI
TL;DR: A novel blind image watermarking scheme based on QR decomposition is proposed to embed color watermark image into color host image, which is significantly different from using the binary or gray image as watermark.
Abstract: In this paper, a novel blind image watermarking scheme based on QR decomposition is proposed to embed color watermark image into color host image, which is significantly different from using the binary or gray image as watermark. When embedding watermark, the 24-bits color host image with size of 512?×?512 is divided into non-overlapping 4?×?4 pixel blocks and each pixel block is decomposed by QR. Then, according to the watermark information and the relation between the second row first column coefficient and the third row first column coefficient in the unitary matrix Q, the 24-bits color watermark image with size of 32?×?32 is embedded into the color host image. In addition, the new element compensatory method is used in the upper-triangle matrix R for reducing the visible distortion. When extracting watermark, only the watermarked image is needed. Compared with other SVD-based methods, the proposed method does not have the false-positive detection problem and has lower computational complexity, that is, the average running time of the proposed method only needs 1.481403 s. The experimental results show that the proposed method is robust against most common attacks including JPEG compression, JPEG 2000 compression, low-pass filtering, cropping, adding noise, blurring, rotation, scaling and sharpening et al. Compared with some related existing methods, the proposed algorithm has stronger robustness and better invisibility.

50 citations


Proceedings ArticleDOI
23 Jun 2014
TL;DR: FT-ScaLAPACK is presented, a fault tolerant version ScaLapACK that is able to detect, locate, and correct errors in Cholesky, QR, and LU factorizations on-line in the middle of the computation in a timely manner before the errors propagate and accumulate.
Abstract: It is well known that soft errors in linear algebra operations can be detected off-line at the end of the computation using algorithm-based fault tolerance (ABFT). However, traditional ABFT usually cannot correct errors in Cholesky, QR, and LU factorizations because any error in one matrix element will be propagated to many other matrix elements and hence cause too many errors to correct. Although, recently, tremendous progresses have been made to correct errors in LU and QR factorizations, these new techniques correct errors off-line at the end of the computation after errors propagated and accumulated, which significantly complicates the error correction process and introduces at least quadratically increasing overhead as the number of errors increases. In this paper, we present the design and implementation of FT-ScaLAPACK, a fault tolerant version ScaLAPACK that is able to detect, locate, and correct errors in Cholesky, QR, and LU factorizations on-line in the middle of the computation in a timely manner before the errors propagate and accumulate. FT-ScaLAPACK has been validated with thousands of cores on Stampede at the Texas Advanced Computing Center. Experimental results demonstrate that FT-ScaLAPACK is able to achieve comparable performance and scalability with the original ScaLAPACK.

46 citations


Posted Content
TL;DR: L-CCA as mentioned in this paper is an iterative algorithm which can compute Canonical Correlation Analysis (CCA) fast on huge sparse datasets, and it is shown to outperform other fast CCA approximation schemes on two real datasets.
Abstract: Canonical Correlation Analysis (CCA) is a widely used statistical tool with both well established theory and favorable performance for a wide range of machine learning problems. However, computing CCA for huge datasets can be very slow since it involves implementing QR decomposition or singular value decomposition of huge matrices. In this paper we introduce L-CCA, a iterative algorithm which can compute CCA fast on huge sparse datasets. Theory on both the asymptotic convergence and finite time accuracy of L-CCA are established. The experiments also show that L-CCA outperform other fast CCA approximation schemes on two real datasets.

44 citations


Journal ArticleDOI
TL;DR: It is shown that the QR decomposition offers capacity and robustness comparable to or better than similar watermarking based on the DCT and SVD transformations and an interesting property of QR decompose is proved and put in practice in more details.
Abstract: A blind watermarking technique based on QR decomposition is proposed on still images. The method is presented in spatial as well as transform domains and its robustness against some well-known image processing attacks is evaluated. It is shown that the QR decomposition offers capacity and robustness comparable to or better than similar watermarking based on the DCT and SVD transformations. Also, an interesting property of QR decomposition, which for the first time has been introduced by our previous paper, is proved and put in practice in more details.

39 citations


Posted Content
TL;DR: In this article, the authors describe efficient algorithms for the computation of the CUR and ID decompositions, which are based on simple modifications to the classical truncated pivoted QR decomposition.
Abstract: The manuscript describes efficient algorithms for the computation of the CUR and ID decompositions. The methods used are based on simple modifications to the classical truncated pivoted QR decomposition, which means that highly optimized library codes can be utilized for implementation. For certain applications, further acceleration can be attained by incorporating techniques based on randomized projections. Numerical experiments demonstrate advantageous performance compared to existing techniques for computing CUR factorizations.

38 citations


Proceedings ArticleDOI
19 May 2014
TL;DR: The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm.
Abstract: The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. We also provide algorithmic improvements to the ScaLAPACK and CAQR algorithms.

37 citations


Journal ArticleDOI
TL;DR: The effectiveness of SGSA in reconstructing and predicting two noisy benchmark nonlinear dynamic systems: the Lorenz and Mackey-Glass attractors are demonstrated and the applicability of the SGSA method in real-life applications is demonstrated.
Abstract: Various time-series decomposition techniques, including wavelet transform, singular spectrum analysis, empirical mode decomposition and independent component analysis, have been developed for non-linear dynamic system analysis. In this paper, we describe a symplectic geometry spectrum analysis (SGSA) method to decompose a time series into a set of independent additive components. SGSA is performed in four steps: embedding, symplectic QR decomposition, grouping and diagonal averaging. The obtained components can be used for de-noising, prediction, control and synchronization. We demonstrate the effectiveness of SGSA in reconstructing and predicting two noisy benchmark nonlinear dynamic systems: the Lorenz and Mackey-Glass attractors. Examples of prediction of a decadal average sunspot number time series and a mechanomyographic signal recorded from human skeletal muscle further demonstrate the applicability of the SGSA method in real-life applications.

Journal ArticleDOI
TL;DR: The insight from, and conclusions of this paper motivate efficient and numerically robust ‘new’ variants of algorithms for solving the single response partial least squares regression (PLS1) problem.
Abstract: The insight from, and conclusions of this paper motivate efficient and numerically robust ‘new’ variants of algorithms for solving the single response partial least squares regression (PLS1) problem. Prototype MATLAB code for these variants are included in the Appendix. The analysis of and conclusions regarding PLS1 modelling are based on a rich and nontrivial application of numerous key concepts from elementary linear algebra. The investigation starts with a simple analysis of the nonlinear iterative partial least squares (NIPALS) PLS1 algorithm variant computing orthonormal scores and weights. A rigorous interpretation of the squared P-loadings as the variable-wise explained sum of squares is presented. We show that the orthonormal row-subspace basis of W-weights can be found from a recurrence equation. Consequently, the NIPALS deflation steps of the centered predictor matrix can be replaced by a corresponding sequence of Gram–Schmidt steps that compute the orthonormal column-subspace basis of T-scores from the associated non-orthogonal scores. The transitions between the non-orthogonal and orthonormal scores and weights (illustrated by an easy-to-grasp commutative diagram), respectively, are both given by QR factorizations of the non-orthogonal matrices. The properties of singular value decomposition combined with the mappings between the alternative representations of the PLS1 ‘truncated’ X data (including PtW) are taken to justify an invariance principle to distinguish between the PLS1 truncation alternatives. The fundamental orthogonal truncation of PLS1 is illustrated by a Lanczos bidiagonalization type of algorithm where the predictor matrix deflation is required to be different from the standard NIPALS deflation. A mathematical argument concluding the PLS1 inconsistency debate (published in 2009 in this journal) is also presented. Copyright © 2014 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
16 Nov 2014
TL;DR: Although the communication cost of CholeskyQR2 is twice that of TSQR, it has an advantage that its reduction operation is addition whereas that ofTSQR is a QR factorization, whose high-performance implementation is more difficult.
Abstract: Designing communication-avoiding algorithms is crucial for high performance computing on a large-scale parallel system. The TSQR algorithm is a communication-avoiding algorithm for computing a tall-skinny QR factorization, and TSQR is known to be much faster and as stable as the classical Householder QR algorithm. The Cholesky QR algorithm is another very simple and fast communication-avoiding algorithm, but rarely used in practice because of its numerical instability. Our recent work points out that an algorithm that simply repeats Cholesky QR twice, which we call CholeskyQR2, gives excellent accuracy for a wide range of matrices arising in practice. Although the communication cost of CholeskyQR2 is twice that of TSQR, it has an advantage that its reduction operation is addition whereas that of TSQR is a QR factorization, whose high-performance implementation is more difficult. Thus, CholeskyQR2 can potentially be significantly faster than TSQR. Indeed, in our experiments using 16384 nodes of the K computer, CholeskyQR2 ran about three times faster than TSQR for a 4194304 × 64 matrix.

Journal ArticleDOI
TL;DR: This paper presents the first VLSI implementation of a complete LRA K-best detector with an 8 × 8 dimension and the corresponding energy per bit is 63 pJ/bit, which is the smallest value achieved to date.
Abstract: This paper presents the VLSI implementation of a lattice-reduction-aided (LRA) detection system. The proposed system includes a QR decomposition, lattice reduction (LR) processor, and sorting-reduced (SR) K-best detector for 8 $\,\times\,$ 8 multiple-input multiple-output (MIMO) systems. The bit error rate of the proposed MIMO detection system only incurs approximately 3 dB of implementation loss compared with optimal maximum likelihood detection with 64-quadratic-amplitude modulation. The proposed processor can also support different throughput requirements by adjusting the stage number of LR. The SR K-best detector can achieve 3.1 Gb/s throughput with 0.24-ns latency. The throughput of the system reaches 585 Mb/s if one channel preprocessing can support 72 symbol detections. The corresponding energy per bit is 63 pJ/bit, which is the smallest value achieved to date. This paper presents the first VLSI implementation of a complete LRA K-best detector with an 8 $\,\times\,$ 8 dimension.

Proceedings ArticleDOI
04 May 2014
TL;DR: Algorithms for Tucker tensor decomposition are proposed, which can avoid computing singular value decomposition or eigenvalue decomposition of large matrices as in the work-horse higher order orthogonal iteration (HOOI) algorithm.
Abstract: We propose algorithms for Tucker tensor decomposition, which can avoid computing singular value decomposition or eigenvalue decomposition of large matrices as in the work-horse higher order orthogonal iteration (HOOI) algorithm. The novel algorithms require computational cost of O(I 3 R), which is cheaper than O(I 3 R + IR 4 + R 6 ) of HOOI for multilinear rank-(R, R, R) tensors of size I × I × I.

Journal ArticleDOI
TL;DR: The proposed algorithm performs SQRD through orthogonalizations based on the modified Gram-Schmidt process, rearranging the column vectors of a real-valued MIMO channel matrix in such a way that the symmetry between the vectors is maintained.
Abstract: QR decomposition (QRD) is a preprocessing technique for detecting symbols in multiple-input and multiple-output (MIMO) systems, but the computational complexity is prohibitively high when the systems incorporate a large number of antennas. This paper presents a low-complexity sorted QRD (SQRD) algorithm for MIMO systems. The proposed algorithm performs SQRD through orthogonalizations based on the modified Gram-Schmidt process, rearranging the column vectors of a real-valued MIMO channel matrix in such a way that the symmetry between the vectors is maintained. By using the symmetry, the computations required for orthogonalizing one of the two adjacent vectors can be eliminated effectively, which significantly reduces the computational complexity. Theoretical analyses show that the proposed algorithm reduces the computational complexity required for SQRD by 50% for any MIMO configurations, when compared to the conventional algorithm. In addition, the memory requirement to store resultant matrices is 50% of that in the conventional one.

Journal ArticleDOI
TL;DR: The block diagonal Jacket matrix decomposition is proposed, which is able not only to extend the conventional block diagonal channel decomposition but also to achieve the MIMO broadcast channel capacity.
Abstract: The block diagonalization (BD) is a linear precoding technique for multi-user multi-input multi-output (MIMO) broadcast channels, which is able to completely eliminate the multi-user interference (MUI), but it is not computationally efficient. In this paper, we propose the block diagonal Jacket matrix decomposition, which is able not only to extend the conventional block diagonal channel decomposition but also to achieve the MIMO broadcast channel capacity. We also prove that the QR algorithm achieves the same sum rate as that of the conventional BD scheme. The complexity analysis shows that our proposal is more efficient than the conventional BD method in terms of the number of the required computation.

Journal ArticleDOI
TL;DR: In this paper, a thermodynamically consistent finite deformation continuum model is developed to simulate the thermomechanical response of shape memory polymers (SMPs) under simple shear deformation.

Proceedings Article
08 Dec 2014
TL;DR: L-CCA as mentioned in this paper is an iterative algorithm which can compute Canonical Correlation Analysis (CCA) fast on huge sparse datasets, which is a widely used statistical tool with both established theory and favorable performance for a wide range of machine learning problems.
Abstract: Canonical Correlation Analysis (CCA) is a widely used statistical tool with both well established theory and favorable performance for a wide range of machine learning problems. However, computing CCA for huge datasets can be very slow since it involves implementing QR decomposition or singular value decomposition of huge matrices. In this paper we introduce L-CCA , a iterative algorithm which can compute CCA fast on huge sparse datasets. Theory on both the asymptotic convergence and finite time accuracy of L-CCA are established. The experiments also show that L-CCA outperform other fast CCA approximation schemes on two real datasets.

Journal ArticleDOI
TL;DR: A Rank Adaptive Atomic Decomposition for Low-Rank Matrix Completion (RAADLRMC) algorithm is proposed based on the Atomic Decomsposition for Minimum Rank Approximation that is robust, and the rank of matrix can be predicted accurately.

Journal ArticleDOI
TL;DR: In this paper, a sensor placement method is proposed for better prediction of the dynamic response reconstruction, which is based on the transmissibility matrix between two sets of sensor locations, and a two-step sensor placement algorithm is proposed.

Journal ArticleDOI
TL;DR: An efficient method based on group theory for the Moore-Penrose inverse problems for symmetric structures is proposed, which can deal with not only well-conditioned but also rank deficient matrices as discussed by the authors.
Abstract: The Moore-Penrose inverse has many applications in civil engineering, such as structural control, nonlinear buckling, and form-finding. However, solving the generalized inverse requires ample computational resources, especially for large-sized matrices. An efficient method based on group theory for the Moore-Penrose inverse problems for symmetric structures is proposed, which can deal with not only well-conditioned but also rank deficient matrices. First, the QR decomposition algorithm is chosen to evaluate the generalized inverse of any sparse and rank deficient matrix. In comparison with other well established algorithms, the QR method has superiority in computation efficiency and accuracy. Then, a group-theoretic approach to computing the Moore-Penrose inverse for problems involving symmetric structures is described. Based on the inherent symmetry and the irreducible representations, the orthogonal transformation matrices are deduced to express the inverse problem in a symmetry-adapted coordinate system. The original problem is transferred into computing the generalized inverse of many independent submatrices. Numerical experiments on three different types of structures with cyclic or dihedral symmetry are carried out. It is concluded from the numerical results and comparisons with two conventional methods that the proposed technique is efficient and accurate. DOI: 10.1061/(ASCE)CP.1943-5487.0000266. © 2014 American Society of Civil Engineers.

Patent
31 Dec 2014
TL;DR: In this paper, a strong maneuver-based target tracking method comprises the following steps: initializing parameters, interactively inputting models, judging the covariance matrix, filtering in parallel, updating the model probabilities, interactingively outputting the models, and filtering in a fixed-delay smoothing manner.
Abstract: A strong maneuver-based target tracking method comprises the following steps: initializing parameters, interactively inputting models, judging the covariance matrix, filtering in parallel, updating the model probabilities, interactively outputting the models, filtering in a fixed-delay smoothing manner, and judging whether the state updating is completed; on the basis of the IMM algorithm, the IMM algorithm for recalculating the weight is used, that is, RIMM. The method not only uses the model probabilities, but also takes full advantage of the filtering covariance matrix, so that the tracking accuracy is higher. In addition, an SRCK method is used in the filter forecasting stage, and adopts the spherical integrating criterion and the radial integrating criterion. Compared with a wide UKF algorithm in nonlinear filtering, the target tracking method optimizes the sigma point sampling strategy and the weight distribution in the UKF. Meanwhile, QR factorization is introduced in SRCKF, so that the matrix squiring operation is avoided, and the filtering stability is improved. On the basis, the filtering in a fixed-delay smoothing manner is also introduced, so that the real-time property and the accuracy of target tracking are further improved.

01 Jan 2014
TL;DR: This thesis targets the design of parallelizable algorithms and communication-efficient parallel schedules for numerical linear algebra as well as computations with higher-order tensors, and introduces Cyclops Tensor Framework, which provides an automated mechanism for network-topology-aware decomposition and redistribution of tensor data.
Abstract: Author(s): Solomonik, Edgar | Advisor(s): Demmel, James | Abstract: This thesis targets the design of parallelizable algorithms and communication-efficient parallel schedules for numerical linear algebra as well as computations with higher-order tensors. Communication is a growing bottleneck in the execution of most algorithms on parallel computers, which manifests itself as data movement both through the network connecting different processors and through the memory hierarchy of each processor as well as synchronization between processors. We provide a rigorous theoretical model of communication and derive lower bounds as well as algorithms in this model. Our analysis concerns two broad areas of linear algebra and of tensor contractions. We demonstrate the practical quality of the new theoretically-improved algorithms by presenting results which show that our implementations outperform standard libraries and traditional algorithms. We model the costs associated with local computation, interprocessor communication and synchronization, as well as memory to cache data transfers of a parallel schedule based on the most expensive execution path in the schedule. We introduce a new technique for deriving lower bounds on tradeoffs between these costs and apply them to algorithms in both dense and sparse linear algebra as well as graph algorithms. These lower bounds are attained by what we refer to as 2.5D algorithms, which we give for matrix multiplication, Gaussian elimination, QR factorization, the symmetric eigenvalue problem, and the Floyd-Warshall all-pairs shortest-paths algorithm. 2.5D algorithms achieve lower interprocessor bandwidth cost by exploiting auxiliary memory. Algorithms employing this technique are well known for matrix multiplication, and have been derived in the BSP model for LU and QR factorization, as well as the Floyd-Warshall algorithm. We introduce alternate versions of LU and QR algorithms which have measurable performance improvements over their BSP counterparts, and we give the first evaluations of their performance. We also explore network-topology-aware mapping on torus networks for matrix multiplication and LU, showing how 2.5D algorithms can efficiently exploit collective communication, as well as introducing an adaptation of Cannon's matrix multiplication algorithm that is better suited for torus networks with dimension larger than two. For the symmetric eigenvalue problem, we give the first 2.5D algorithms, additionally solving challenges with memory-bandwidth efficiency that arise for this problem. We also give a new memory-bandwidth efficient algorithm for Krylov subspace methods (repeated multiplication of a vector by a sparse-matrix), which is motivated by the application of our lower bound techniques to this problem. The latter half of the thesis contains algorithms for higher-order tensors, in particular tensor contractions. The motivating application for this work is the family of coupled-cluster methods, which solve the many-body Schrodinger equation to provide a chemically-accurate model of the electronic structure of molecules and chemical reactions where electron correlation plays a significant role. The numerical computation of these methods is dominated in cost by contraction of antisymmetric tensors. We introduce Cyclops Tensor Framework, which provides an automated mechanism for network-topology-aware decomposition and redistribution of tensor data. It leverages 2.5D matrix multiplication to perform tensor contractions communication-efficiently. The framework is capable of exploiting symmetry and antisymmetry in tensors and utilizes a distributed packed-symmetric storage format. Finally, we consider a theoretically novel technique for exploiting tensor symmetry to lower the number of multiplications necessary to perform a contraction via computing some redundant terms that allow preservation of symmetry and then cancelling them out with low-order cost. We analyze the numerical stability and communication efficiency of this technique and give adaptations to antisymmetric and Hermitian matrices. This technique has promising potential for accelerating coupled-cluster methods both in terms of computation and communication cost, and additionally provides a potential improvement for BLAS routines on complex matrices.

Journal ArticleDOI
01 Jul 2014
TL;DR: This work investigates the viability of implementing QR updating algorithms on GPUs and demonstrates that GPU-based updating for removing columns achieves speed-ups of up to 13.5x compared with full GPU QR factorization.
Abstract: Linear least squares problems are commonly solved by QR factorization. When multiple solutions need to be computed with only minor changes in the underlying data, knowledge of the difference between the old data set and the new can be used to update an existing factorization at reduced computational cost. We investigate the viability of implementing QR updating algorithms on GPUs and demonstrate that GPU-based updating for removing columns achieves speed-ups of up to 13.5x compared with full GPU QR factorization. We characterize the conditions under which other types of updates also achieve speed-ups.

Posted Content
TL;DR: This paper considers the stability of the QR factorization in an oblique inner product and analyzes two algorithm that are based a factorization of A and converting the problem to the Euclidean case using the Cholesky decomposition and the eigenvalue decomposition.
Abstract: In this paper we consider the stability of the QR factorization in an oblique inner product. The oblique inner product is defined by a symmetric positive definite matrix A. We analyze two algorithm that are based a factorization of A and converting the problem to the Euclidean case. The two algorithms we consider use the Cholesky decomposition and the eigenvalue decomposition. We also analyze algorithms that are based on computing the Cholesky factor of the normal equa- tion. We present numerical experiments to show the error bounds are tight. Finally we present performance results for these algorithms as well as Gram-Schmidt methods on parallel architecture. The performance experiments demonstrate the benefit of the communication avoiding algorithms.

Proceedings ArticleDOI
28 Aug 2014
TL;DR: This paper introduces the doubly-restricted contraction constant (DRCC), a characteristic of a matrix, which predicts the feasibility of matrix recovery from a subset of its entries and establishes results regarding the convergence rate of the algorithm using the DRCC.
Abstract: The matrix completion problem addresses the recovery of a low-rank matrix from a subset of its entries. In this paper, we analyze rank-r matrix completion algorithm based on the rank-r singular value decomposition (SVD). We introduce the doubly-restricted contraction constant (DRCC), a characteristic of a matrix, which predicts the feasibility of matrix recovery from a subset of its entries. We establish results regarding the convergence rate of the algorithm using the DRCC. Numerical experiments indicate that the DRCC accurately predicts the recovery of a matrix from a subset of its entries.

Journal ArticleDOI
TL;DR: A new block-wise complex Givens rotation (BCGR) based algorithm and a triangular systolic array (TSA) to compute the QRD of the equivalent channel matrix in an Alamouti block by block manner, which can compute QRDs of 4-by-4 equivalent channel matrices faster than any architecture that has been developed for the SDM MIMO system.
Abstract: Unlike the channel matrix in the spatial division multiplexing (SDM) multiple-input multiple-output (MIMO) communication system, the equivalent channel matrix in the layered Alamouti space-time block coding (STBC) MIMO system comprised 2-by-2 Alamouti sub-blocks. One novel property, found by Sayed about the QR-decomposition (QRD) of this equivalent channel matrix is that the produced Q- and R-matrices are also matrices with Alamouti sub-blocks. Taking advantage of this property, we propose a new block-wise complex Givens rotation (BCGR) based algorithm and a triangular systolic array (TSA) to compute the QRD of the equivalent channel matrix in an Alamouti block by block manner. Implementation results reveal that our new TSA can compute QRDs of 4-by-4 equivalent channel matrices faster than any architecture that has been developed for the SDM MIMO system. This property of fast QRD makes our TSA very attractive for the layered Alamouti STBC MIMO system combined with the orthogonal frequency division multiplexing. Our new BCGR based approach can also be applied to the hybrid Alamouti STBC MIMO system, which is also a system with equivalent channel matrix consisting of Alamouti sub-blocks.

Journal ArticleDOI
TL;DR: This work shows how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor.
Abstract: We show how both the tridiagonal and bidiagonal QR algorithms can be restructured so that they become rich in operations that can achieve near-peak performance on a modern processor. The key is a novel, cache-friendly algorithm for applying multiple sets of Givens rotations to the eigenvector/singular vector matrix. This algorithm is then implemented with optimizations that: (1) leverage vector instruction units to increase floating-point throughput, and (2) fuse multiple rotations to decrease the total number of memory operations. We demonstrate the merits of these new QR algorithms for computing the Hermitian eigenvalue decomposition (EVD) and singular value decomposition (SVD) of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to traditional QR algorithms for these problems and is competitive with two commonly used alternatives---Cuppen’s Divide-and-Conquer algorithm and the method of Multiple Relatively Robust Representations---while inheriting the more modest O(n) workspace requirements of the original QR algorithms. Since the computations performed by the restructured algorithms remain essentially identical to those performed by the original methods, robust numerical properties are preserved.

Journal ArticleDOI
01 Jul 2014
TL;DR: This paper presents a new stable algorithm for the parallel QR-decomposition of ''tall and skinny'' matrices based on the fast but unstable CholeskyQR algorithm and provides promising results of the MPI-based implementation on a BlueGene/P and a Power6 system.
Abstract: In this paper we present a new stable algorithm for the parallel QR-decomposition of ''tall and skinny'' matrices. The algorithm has been developed for the dense symmetric eigensolver ELPA, where the QR-decomposition of tall and skinny matrices represents an important substep. Our new approach is based on the fast but unstable CholeskyQR algorithm (Stathopoulos and Wu, 2002) [1]. We show the stability of our new algorithm and provide promising results of our MPI-based implementation on a BlueGene/P and a Power6 system.