scispace - formally typeset
Search or ask a question

Showing papers on "QR decomposition published in 2010"


Journal ArticleDOI
TL;DR: An approximation algorithm for finding optimal decompositions which is based on the insight provided by the theorem and significantly outperforms a greedy approximation algorithms for a set covering problem to which the problem of matrix decomposition is easily shown to be reducible.

254 citations


Journal ArticleDOI
TL;DR: A least-squares solver for dense highly overdetermined systems that achieves residuals similar to those of direct QR factorization- based solvers, outperforms lapack by large factors, and scales significantly better than any QR-based solver.
Abstract: Several innovative random-sampling and random-mixing techniques for solving problems in linear algebra have been proposed in the last decade, but they have not yet made a significant impact on numerical linear algebra. We show that by using a high-quality implementation of one of these techniques, we obtain a solver that performs extremely well in the traditional yardsticks of numerical linear algebra: it is significantly faster than high-performance implementations of existing state-of-the-art algorithms, and it is numerically backward stable. More specifically, we describe a least-squares solver for dense highly overdetermined systems that achieves residuals similar to those of direct QR factorization-based solvers (lapack), outperforms lapack by large factors, and scales significantly better than any QR-based solver.

182 citations


Journal ArticleDOI
Taiping Zhang1, Bin Fang1, Yuan Yan Tang1, Zhaowei Shang1, Bin Xu1 
01 Feb 2010
TL;DR: Comparisons of experimental results on different data sets are given with respect to existing LDA extensions, including PCA + LDA, LDA via generalized singular value decomposition, regularized L DA, NLDA, and LDA through QR decompose, which demonstrate the effectiveness of the proposed EDA method.
Abstract: Linear discriminant analysis (LDA) is well known as a powerful tool for discriminant analysis. In the case of a small training data set, however, it cannot directly be applied to high-dimensional data. This case is the so-called small-sample-size or undersampled problem. In this paper, we propose an exponential discriminant analysis (EDA) technique to overcome the undersampled problem. The advantages of EDA are that, compared with principal component analysis (PCA) + LDA, the EDA method can extract the most discriminant information that was contained in the null space of a within-class scatter matrix, and compared with another LDA extension, i.e., null-space LDA (NLDA), the discriminant information that was contained in the non-null space of the within-class scatter matrix is not discarded. Furthermore, EDA is equivalent to transforming original data into a new space by distance diffusion mapping, and then, LDA is applied in such a new space. As a result of diffusion mapping, the margin between different classes is enlarged, which is helpful in improving classification accuracy. Comparisons of experimental results on different data sets are given with respect to existing LDA extensions, including PCA + LDA, LDA via generalized singular value decomposition, regularized LDA, NLDA, and LDA via QR decomposition, which demonstrate the effectiveness of the proposed EDA method.

160 citations


Proceedings ArticleDOI
TL;DR: This work is on CULA, a GPU accelerated implementation of linear algebra routines, and presents results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares.
Abstract: The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares. The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.

153 citations


Journal ArticleDOI
TL;DR: An iterative algorithm is proposed for joint multi-path Rayleigh channel complex gains estimation and data recovery in fast fading environments and is supported by theoretical analysis and simulation results, which are obtained considering Jakes' channels with high Doppler spreads.
Abstract: This paper deals with the case of a high speed mobile receiver operating in an orthogonal-frequency-division-multiplexing (OFDM) communication system. Assuming the knowledge of delay-related information, we propose an iterative algorithm for joint multi-path Rayleigh channel complex gains estimation and data recovery in fast fading environments. Each complex gain time-variation, within one OFDM symbol, is approximated by a polynomial representation. Based on the Jakes process, an auto-regressive (AR) model of the polynomial coefficients dynamics is built, making it possible to employ the Kalman filter estimator for the polynomial coefficients. Hence, the channel matrix is easily computed, and the data symbol is estimated with free inter-sub-carrier-interference (ICI) thanks to the use of a QR-decomposition of the channel matrix. Our claims are supported by theoretical analysis and simulation results, which are obtained considering Jakes' channels with high Doppler spreads.

118 citations


Journal ArticleDOI
TL;DR: Two greedy algorithms that compute discrete versions of Fekete-like points for multivariate compact sets by basic tools of numerical linear algebra are discussed and compared.
Abstract: We discuss and compare two greedy algorithms that compute discrete versions of Fekete-like points for multivariate compact sets by basic tools of numerical linear algebra. The first gives the so-called approximate Fekete points by QR factorization with column pivoting of Vandermonde-like matrices. The second computes discrete Leja points by LU factorization with row pivoting. Moreover, we study the asymptotic distribution of such points when they are extracted from weakly admissible meshes.

111 citations


Book ChapterDOI
05 Sep 2010
TL;DR: This work improves on the latest published approaches to bundle adjustment with conjugate gradients by making full use of the least squares nature of the problem and shows how a certain property of the preconditioned system allows us to reduce the work per iteration to roughly half of the standard CG algorithm.
Abstract: Bundle adjustment for multi-view reconstruction is traditionally done using the Levenberg-Marquardt algorithm with a direct linear solver, which is computationally very expensive. An alternative to this approach is to apply the conjugate gradients algorithm in the inner loop. This is appealing since the main computational step of the CG algorithm involves only a simple matrix-vector multiplication with the Jacobian. In this work we improve on the latest published approaches to bundle adjustment with conjugate gradients by making full use of the least squares nature of the problem. We employ an easy-to-compute QR factorization based block preconditioner and show how a certain property of the preconditioned system allows us to reduce the work per iteration to roughly half of the standard CG algorithm.

94 citations


Journal ArticleDOI
TL;DR: The newly proposed method (termed as CX_D) selects columns in a deterministic manner, which well approximates singular value decomposition.
Abstract: In this paper, we propose a deterministic column-based matrix decomposition method. Conventional column-based matrix decomposition (CX) computes the columns by randomly sampling columns of the data matrix. Instead, the newly proposed method (termed as CX_D) selects columns in a deterministic manner, which well approximates singular value decomposition. The experimental results well demonstrate the power and the advantages of the proposed method upon three real-world data sets.

76 citations


Journal ArticleDOI
TL;DR: The proposed IQRD hardware is constructed by the diagonal and the triangular process with fewer gate counts and lower power consumption than TSAQRD, and the total clock latency is only 10 m - 5 cycles.
Abstract: Implementation of an iterative QR decomposition (QRD) (IQRD) architecture based on the modified Gram-Schmidt (MGS) algorithm is proposed in this paper. A QRD is extensively adopted by the detection of multiple-input-multiple-output systems. In order to achieve computational efficiency with robust numerical stability, a triangular systolic array (TSA) for QRD of large-size matrices is presented. In addition, the TSA architecture can be modified into an iterative architecture that is called IQRD for reducing hardware cost. The IQRD hardware is constructed by the diagonal and the triangular process with fewer gate counts and lower power consumption than TSAQRD. For a 4 t 4 matrix, the hardware area of the proposed IQRD can reduce about 41% of the gate counts in TSAQRD. For a generic square matrix of order m IQRD, the latency required is 2m - 1 time units, which is based on the MGS algorithm. Thus, the total clock latency is only 10 m - 5 cycles.

70 citations


Journal ArticleDOI
TL;DR: The Halley iteration can be implemented via QR decompositions without explicit matrix inversions, and it is an inverse free communication friendly algorithm for the emerging multicore and hybrid high performance computing systems.
Abstract: We introduce a dynamically weighted Halley (DWH) iteration for computing the polar decomposition of a matrix, and we prove that the new method is globally and asymptotically cubically convergent. For matrices with condition number no greater than $10^{16}$, the DWH method needs at most six iterations for convergence with the tolerance $10^{-16}$. The Halley iteration can be implemented via QR decompositions without explicit matrix inversions. Therefore, it is an inverse free communication friendly algorithm for the emerging multicore and hybrid high performance computing systems.

63 citations


Journal ArticleDOI
TL;DR: In this article, the authors propose an algorithm for finding the finest simultaneous block-diagonalization of a finite number of square matrices, or equivalently the irreducible decomposition of a matrix *-algebra given in terms of its generators.
Abstract: An algorithm is proposed for finding the finest simultaneous block-diagonalization of a finite number of square matrices, or equivalently the irreducible decomposition of a matrix *-algebra given in terms of its generators This extends the approach initiated by Murota–Kanno–Kojima–Kojima The algorithm, composed of numerical-linear algebraic computations, does not require any algebraic structure to be known in advance The main ingredient of the algorithm is the Schur decomposition and its skew-Hamiltonian variant for eigenvalue computation

Journal ArticleDOI
TL;DR: This algorithm amounts to transforming a polynomial matrix to upper triangular form by application of a series of paraunitary matrices such as elementary delay and rotation matrices, and can be used to formulate the singular value decomposition (SVD) of a poynomial matrix.
Abstract: In this paper, a new algorithm for calculating the QR decomposition (QRD) of a polynomial matrix is introduced. This algorithm amounts to transforming a polynomial matrix to upper triangular form by application of a series of paraunitary matrices such as elementary delay and rotation matrices. It is shown that this algorithm can also be used to formulate the singular value decomposition (SVD) of a polynomial matrix, which essentially amounts to diagonalizing a polynomial matrix again by application of a series of paraunitary matrices. Example matrices are used to demonstrate both types of decomposition. Mathematical proofs of convergence of both decompositions are also outlined. Finally, a possible application of such decompositions in multichannel signal processing is discussed.

Journal ArticleDOI
TL;DR: This paper presents a new implementation for the null space based linear discriminant analysis and it is shown that the optimal transformation matrix is obtained easily by only orthogonal transformations without computing any eigendecomposition and singular value decomposition (SVD), consequently, its main computational complexity is from a economic QR factorization of the data matrix.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: A recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) is articulate in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites.
Abstract: Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLA-PACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization – one of the main dense linear algebra kernels – of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).

Proceedings ArticleDOI
09 Jan 2010
TL;DR: This work presents a novel parallel cache assignment approach which scales well with p and applies this general approach to the QR and LU panel factorizations on two commodity 8-core platforms with very different cache structures, and demonstrates superlinear panel factorization speedups on both machines.
Abstract: In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level~3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl's law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8-core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the high-order flop count, and do not perform as well. By demonstrating a straight-forward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.

Journal ArticleDOI
TL;DR: A design algorithm is constructed for discrete-time feedback control that allows to stabilize a target subspace, proving that if the control problem is feasible, then the algorithm returns an effective control choice.
Abstract: We analyze the asymptotic behavior of discrete-time, Markovian quantum systems with respect to a subspace of interest. Global asymptotic stability of subspaces is relevant to quantum information processing, in particular for initializing the system in pure states or subspace codes. We provide a linear-algebraic characterization of the dynamical properties leading to invariance and attractivity of a given quantum subspace. We then construct a design algorithm for discrete-time feedback control that allows to stabilize a target subspace, proving that if the control problem is feasible, then the algorithm returns an effective control choice. In order to prove this result, a canonical QR matrix decomposition is derived, and also used to establish the control scheme potential for the simulation of open-system dynamics.

Journal ArticleDOI
TL;DR: A stable method for accurate fitting that automatically determines the moderate degree of IP until a satisfactory fitting result is obtained and can selectively apply ridge regression-based constraints to that element only.
Abstract: Representing 2D and 3D data sets with implicit polynomials (IPs) has been attractive because of its applicability to various computer vision issues. Therefore, many IP fitting methods have already been proposed. However, the existing fitting methods can be and need to be improved with respect to computational cost for deciding on the appropriate degree of the IP representation and to fitting accuracy, while still maintaining the stability of the fit. We propose a stable method for accurate fitting that automatically determines the moderate degree required. Our method increases the degree of IP until a satisfactory fitting result is obtained. The incrementability of QR decomposition with Gram-Schmidt orthogonalization gives our method computational efficiency. Furthermore, since the decomposition detects the instability element precisely, our method can selectively apply ridge regression-based constraints to that element only. As a result, our method achieves computational stability while maintaining fitting accuracy. Experimental results demonstrate the effectiveness of our method compared with prior methods.

Journal ArticleDOI
TL;DR: It is shown that proposed SVD-QRcp based feature selection outperforms F-Ratio based method and the proposed feature extraction tool is superior to baseline MFCC & LFCC.

Proceedings ArticleDOI
13 Nov 2010
TL;DR: The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices, which is able to outperform the de facto ScaLAPACK library by up to 4 times, and has good scalability on up to 3,072 cores.
Abstract: As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: This work presents a new fully asynchronous method for computing a QR factorization on shared-memory multicore architectures that overcomes this bottleneck and aims to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) library.
Abstract: To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a block-column, and edges represent dependencies among them. Although past approaches already achieve high performance on moderate and large square matrices, their way of processing a panel in sequence leads to limited performance when factorizing tall and skinny matrices or small square matrices. We present a new fully asynchronous method for computing a QR factorization on shared-memory multicore architectures that overcomes this bottleneck. Our contribution is to adapt an existing algorithm that performs a panel factorization in parallel (named Communication-A voiding QR and initially designed for distributed-memory machines), to the context of tile algorithms using asynchronous computations. An experimental study shows significant improvement (up to almost 10 times faster) compared to state-of-the-art approaches. We aim to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multi-core Architectures (PLASMA) library.

Proceedings ArticleDOI
15 Oct 2010
TL;DR: It is shown that while the proposed scheme has low computational complexity, it has better robustness against some image processing attacks in comparison with SVD and DCT methods.
Abstract: A novel blind watermarking technique based on QR decomposition in still images is proposed. The method is implemented in wavelet domain and its robustness has been evaluated against some image processing attacks and the results have been compared with two traditional methods i.e., SVD and DCT. It is shown that while the proposed scheme has low computational complexity, it has better robustness against some image processing attacks in comparison with SVD and DCT methods.

Journal ArticleDOI
TL;DR: An implicit version of the shifted QR eigenvalue algorithm given in Bini et al. is presented for computing the eigenvalues of an n × n companion matrix using O ( n 2 ) flops and O( n ) memory storage.

Book ChapterDOI
27 Sep 2010
TL;DR: In this article, a Jacobi-like procedure based on polar matrix decomposition is proposed for the joint eigenvalue decomposition of a set of real non-defective matrices.
Abstract: In this paper we propose a new algorithm for the joint eigenvalue decomposition of a set of real non-defective matrices. Our approach resorts to a Jacobi-like procedure based on polar matrix decomposition. We introduce a new criterion in this context for the optimization of the hyperbolic matrices, giving birth to an original algorithm called JDTM. This algorithm is described in detail and a comparison study with reference algorithms is performed. Comparison results show that our approach provides quicker and more accurate results in all the considered situations.

Journal ArticleDOI
TL;DR: This work shows when and how techniques based on the singular value decomposition (SVD) and the QR decomposition of a fundamental matrix solution can be used to infer if a system enjoys—or not—exponential dichotomy on the whole real line.

01 Jan 2010
TL;DR: For least square problems, the rows of the coefficient matrix vary widely in norm, and the row-wise backward stability of the Householder vector has unsatisfactory backward stability properties.
Abstract: For least squares problems in which the rows of the coefficient matrix vary widely in norm, Householder QR factorization (without pivoting) has unsatisfactory backward stability properties. Powell and Reid showed in 1969 that the use of both row and column pivoting leads to a desirable row-wise backward error result. We give a reworked backward error analysis in modern notation and prove two new results. First, sorting the rows by decreasing ∞-norm at the start of the factorization obviates the need for row pivoting. Second, row-wise backward stability is obtained for only one of the two possible choices of sign in the Householder vector.

Journal ArticleDOI
TL;DR: In this paper, the problem of polynomial least squares fitting in which the usual monomial basis is replaced by the Bernstein basis is considered, and an algorithm for obtaining the QR decomposition of A is applied.

Journal ArticleDOI
TL;DR: Simulation results confirm the bit-error-rate (BER) and throughput performance superiority of the proposed systems compared to conventional SVD per-carrier precoding schemes.
Abstract: QR decomposition (QRD)-based precoded MIMO-OFDM systems with reduced feedback are proposed to convert the MIMO-OFDM channel into layered subchannels. QRD-M is further combined with either singular value (SVD) or geometric mean decomposition (GMD) of the time-domain channel impulse response matrix. As a result, the receiver in the proposed systems only needs to feed back information describing one precoding matrix for all carriers. Simulation results confirm the bit-error-rate (BER) and throughput performance superiority of the proposed systems compared to conventional SVD per-carrier precoding schemes.

Journal ArticleDOI
TL;DR: New rigorous perturbation bounds for the Cholesky, LU, and QR factorizations with normwise or componentwise perturbations in the given matrix can be much tighter than the existing rigorous bounds obtained by the classic matrix equation approach.
Abstract: This article presents rigorous normwise perturbation bounds for the Cholesky, LU, and QR factorizations with normwise or componentwise perturbations in the given matrix. The considered componentwise perturbations have the form of backward rounding errors for the standard factorization algorithms. The used approach is a combination of the classic and refined matrix equation approaches. Each of the new rigorous perturbation bounds is a small constant multiple of the corresponding first-order perturbation bound obtained by the refined matrix equation approach in the literature and can be estimated efficiently. These new bounds can be much tighter than the existing rigorous bounds obtained by the classic matrix equation approach, while the conditions for the former to hold are almost as moderate as the conditions for the latter to hold.

Journal ArticleDOI
TL;DR: Both a normwise and a componentwise error analysis for the QR factorization of long products of invertible matrices are developed and results show the dependence on the degree of nonnormality and the strength of integral separation are illustrated.
Abstract: We develop both a normwise and a componentwise error analysis for the QR factorization of long products of invertible matrices. We obtain global error bounds for both the orthogonal and upper triangular factors that depend on uniform bounds on the size of the local error, the local degree of nonnormality, and integral separation, a natural condition related to gaps between eigenvalues but for products of matrices. We illustrate our analytical results with numerical results that show the dependence on the degree of nonnormality and the strength of integral separation.

Proceedings ArticleDOI
08 Mar 2010
TL;DR: The proposed architecture is the first VLSI implementation of a max-log ML MIMO detector which includes QR decomposition and SO generation, having the latter a deterministic very high throughput thanks to a fully parallelizable structure and parameterizability in terms of both the number of transmit and receive antennas, and the supported modulation orders.
Abstract: In this paper a VLSI architecture of a high throughput and high performance soft-output (SO) MIMO detector (the recently presented Layered ORthogonal Lattice Detector, LORD) is presented. The baseline implementation includes optimal (i.e. maximum-likelihood -- ML -- in the max-log sense) SO generation. A reduced complexity variant of the SO generation stage is also described. To the best of the authors' knowledge, the proposed architecture is the first VLSI implementation of a max-log ML MIMO detector which includes QR decomposition and SO generation, having the latter a deterministic very high throughput thanks to a fully parallelizable structure, and parameterizability in terms of both the number of transmit and receive antennas, and the supported modulation orders. The two designs achieve a very high throughput making them particularly suitable for MIMO-OFDM systems like e.g. IEEE 802.11n WLANs: the most demanding requirements are satisfied at a reasonable cost of area and power consumption.