scispace - formally typeset
Search or ask a question

Showing papers on "Matrix multiplication published in 2002"


Journal ArticleDOI
Uri Zwick1
TL;DR: Two new algorithms for solving the All Pairs Shortest Paths (APSP) problem for weighted directed graphs using fast matrix multiplication algorithms are presented.
Abstract: We present two new algorithms for solving the All Pairs Shortest Paths (APSP) problem for weighted directed graphs. Both algorithms use fast matrix multiplication algorithms.The first algorithm solves the APSP problem for weighted directed graphs in which the edge weights are integers of small absolute value in O(n2+μ) time, where μ satisfies the equation ω(1, μ, 1) = 1 + 2μ and ω(1, μ, 1) is the exponent of the multiplication of an n × nμ matrix by an nμ × n matrix. Currently, the best available bounds on ω(1, μ, 1), obtained by Coppersmith, imply that μ 0 is an error parameter and W is the largest edge weight in the graph, after the edge weights are scaled so that the smallest non-zero edge weight in the graph is 1. It returns estimates of all the distances in the graph with a stretch of at most 1 + ϵ. Corresponding paths can also be found efficiently.

286 citations


Journal ArticleDOI
TL;DR: In this paper, the determinant of the target matrix is log-normally distributed, whereas the remainder is a surprisingly complicated function of a parameter characterizing the norm of the matrix and its skewness.
Abstract: We derive analytic expressions for infinite products of random 2 x 2 matrices. The determinant of the target matrix is log-normally distributed, whereas the remainder is a surprisingly complicated function of a parameter characterizing the norm of the matrix and a parameter characterizing its skewness. The distribution may have importance as an uncommitted prior in statistical image analysis.

277 citations


Journal ArticleDOI
TL;DR: In this article, the authors give explicit inverse formulae for 2 × 2 block matrices with three different partitions and apply these results to obtain inverses of block triangular matrices and various structured matrices such as Hamiltonian, per-Hermitian, and centro-hermitian matrices.
Abstract: In this paper, the authors give explicit inverse formulae for 2 × 2 block matrices with three different partitions. Then these results are applied to obtain inverses of block triangular matrices and various structured matrices such as Hamiltonian, per-Hermitian, and centro-Hermitian matrices.

261 citations


Proceedings ArticleDOI
01 Jul 2002
TL;DR: A natural and geometrically meaningful definition of scalar multiples and a commutative addition of transformations based on the matrix representation are derived, given that the matrices have no negative real eigenvalues.
Abstract: Geometric transformations are most commonly represented as square matrices in computer graphics. Following simple geometric arguments we derive a natural and geometrically meaningful definition of scalar multiples and a commutative addition of transformations based on the matrix representation, given that the matrices have no negative real eigenvalues. Together, these operations allow the linear combination of transformations. This provides the ability to create weighted combination of transformations, interpolate between transformations, and to construct or use arbitrary transformations in a structure similar to a basis of a vector space. These basic techniques are useful for synthesis and analysis of motions or animations. Animations through a set of key transformations are generated using standard techniques such as subdivision curves. For analysis and progressive compression a PCA can be applied to sequences of transformations. We describe an implementation of the techniques that enables an easy-to-use and transparent way of dealing with geometric transformations in graphics software. We compare and relate our approach to other techniques such as matrix decomposition and quaternion interpolation.

253 citations


Journal ArticleDOI
TL;DR: The basic ideas of ℋ- andℋ2-matrices are introduced and an algorithm that adaptively computes approximations of general matrices in the latter format is presented.
Abstract: A class of matrices (H2-matrices) has recently been introduced for storing discretisations of elliptic problems and integral operators from the BEM. These matrices have the following properties: (i) They are sparse in the sense that only few data are needed for their representation. (ii) The matrix-vector multiplication is of linear complexity. (iii) In general, sums and products of these matrices are no longer in the same set, but after truncation to the H2-matrix format these operations are again of quasi-linear complexity.We introduce the basic ideas of H- and H2-matrices and present an algorithm that adaptively computes approximations of general matrices in the latter format.

247 citations


Journal ArticleDOI
TL;DR: In this paper, an explicit expression for an associative ∗-product on the fuzzy complex projective space, CPN−1F, was derived, which generalises previous results for the fuzzy 2-sphere.

184 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a sparse minimum-variance reconstructor for a conventional natural guide star AO system using a sparse approximation for turbulence statistics and recognizing that the nonsparse matrix terms arising from LGS position uncertainty are low-rank adjustments that can be evaluated by using the matrix inversion lemma.
Abstract: The complexity of computing conventional matrix multiply wave-front reconstructors scales as O(n3) for most adaptive optical (AO) systems, where n is the number of deformable mirror (DM) actuators. This is impractical for proposed systems with extremely large n. It is known that sparse matrix methods improve this scaling for least-squares reconstructors, but sparse techniques are not immediately applicable to the minimum-variance reconstructors now favored for multiconjugate adaptive optical (MCAO) systems with multiple wave-front sensors (WFSs) and DMs. Complications arise from the nonsparse statistics of atmospheric turbulence, and the global tip/tilt WFS measurement errors associated with laser guide star (LGS) position uncertainty. A description is given of how sparse matrix methods can still be applied by use of a sparse approximation for turbulence statistics and by recognizing that the nonsparse matrix terms arising from LGS position uncertainty are low-rank adjustments that can be evaluated by using the matrix inversion lemma. Sample numerical results for AO and MCAO systems illustrate that the approximation made to turbulence statistics has negligible effect on estimation accuracy, the time to compute the sparse minimum-variance reconstructor for a conventional natural guide star AO system scales as O(n3/2) and is only a few seconds for n = 3500, and sparse techniques reduce the reconstructor computations by a factor of 8 for sample MCAO systems with 2417 DM actuators and 4280 WFS subapertures. With extrapolation to 9700 actuators and 17,120 subapertures, a reduction by a factor of approximately 30 or 40 to 1 is predicted.

178 citations


Journal ArticleDOI
TL;DR: It is shown that the computational complexity, measured as the number of matrix multiplications, essentially is independent of system size even for metallic materials with a vanishing band gap.
Abstract: A purification algorithm for expanding the single-particle density matrix in terms of the Hamiltonian operator is proposed. The scheme works with a predefined occupation and requires less than half the number of matrix-matrix multiplications compared to existing methods at low (10%) and high (g90%) occupancies. The expansion can be used with a fixed chemical potential, in which case it is an asymmetric generalization of and a substantial improvement over grand canonical McWeeny purification. It is shown that the computational complexity, measured as the number of matrix multiplications, essentially is independent of system size even for metallic materials with a vanishing band gap.

143 citations


Journal ArticleDOI
TL;DR: This paper proves quadratic lower bounds for depth-3 arithmetic circuits over fields of characteristic zero for the elementary symmetric functions, the (trace of) iterated matrix multiplication, and the determinant, and gives new shorter formulae of constant depth for the Elementary symmetrical functions.
Abstract: In this paper we prove quadratic lower bounds for depth-3 arithmetic circuits over fields of characteristic zero. Such bounds are obtained for the elementary symmetric functions, the (trace of) iterated matrix multiplication, and the determinant. As corollaries we get the first nontrivial lower bounds for computing polynomials of constant degree, and a gap between the power of depth-3 arithmetic circuits and depth-4 arithmetic circuits. We also give new shorter formulae of constant depth for the elementary symmetric functions.¶The main technical contribution relates the complexity of computing a polynomial in this model to the wealth of partial derivatives it has on every affine subspace of small co-dimension. Lower bounds for related models utilize an algebraic analog of the Neciporuk lower bound on Boolean formulae.

141 citations


Journal ArticleDOI
TL;DR: Novel recursive blocked algorithms for two-sided matrix equations, which include matrix product terms such as AXBT are presented, and the performance improvements are remarkable, including 10-fold speedups or more, compared to standard algorithms.
Abstract: We continue our study of high-performance algorithms for solving triangular matrix equations. They appear naturally in different condition estimation problems for matrix equations and various eigenspace computations, and as reduced systems in standard algorithms. Building on our successful recursive approach applied to one-sided matrix equations (Part I), we now present novel recursive blocked algorithms for two-sided matrix equations, which include matrix product terms such as AXBT. Examples are the discrete-time standard and generalized Sylvester and Lyapunov equations. The means for achieving high performance is the recursive variable blocking, which has the potential of matching the memory hierarchies of today's high-performance computing systems, and level-3 computations which mainly are performed as GEMM operations. Different implementation issues are discussed, including the design of efficient new algorithms for two-sided matrix products. We present uniprocessor and SMP parallel performance results of recursive blocked algorithms and routines in the state-of-the-art SLICOT library. Although our recursive algorithms with optimized kernels for the two-sided matrix equations perform more operations, the performance improvements are remarkable, including 10-fold speedups or more, compared to standard algorithms.

98 citations


Proceedings ArticleDOI
19 May 2002
TL;DR: For any c = c(m) &rhoe; 1, a lower bound of &OHgr;(m2 log2c m) is obtained for the size of any arithmetic circuit for the product of two matrices, as long as the circuit doesn't use products with field elements of absolute value larger than c.
Abstract: We prove a lower bound of Ω(m2 log m) for the size of any arithmetic circuit for the product of two matrices, over the real or complex numbers, as long as the circuit doesn't use products with field elements of absolute value larger than 1 (where mxm is the size of each matrix). That is, our lower bound is super-linear in the number of inputs and is applied for circuits that use addition gates, product gates and products with field elements of absolute value up to 1. More generally, for any c = c(m) ρ 1, we obtain a lower bound of Ω(m2 log2c m) for the size of any arithmetic circuit for the product of two matrices (over the real or complex numbers), as long as the circuit doesn't use products with field elements of absolute value larger than c. We also prove size-depth tradeoffs for such circuits.

Journal ArticleDOI
01 Jan 2002
TL;DR: A method for the data-sparse approximation of matrices resulting from the discretisation of non-local operators occurring in boundary integral methods or as the inverses of partial differential operators is given.
Abstract: We give a short introduction to a method for the data-sparse approximation of matrices resulting from the discretisation of non-local operators occurring in boundary integral methods or as the inverses of partial differential operators. The result of the approximation will be the so-called hierarchical matrices (or short $\mathcal {H}$-matrices). These matrices form a subset of the set of all matrices and have a data-sparse representation. The essential operations for these matrices (matrix-vector and matrix-matrix multiplication, addition and inversion) can be performed in, up to logarithmic factors, optimal complexity.

Journal ArticleDOI
TL;DR: Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.
Abstract: The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication and the more complex algorithms of Strassen (1969) and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between 10 percent and 20 percent. Carrying the recursive layout down to the level of individual matrix elements is shown to be counterproductive; a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. Five recursive layouts with successively increasing complexity of address computation are evaluated and it is shown that addressing overheads can be kept in control even for the most computationally demanding of these layouts.

Journal ArticleDOI
TL;DR: The irbleigs code is an implementation of an implicitly restarted block-Lanczos method for computing a few selected nearby eigenvalues and associated eigenvectors of a large, possibly sparse, Hermitian matrix A, which makes it well suited for large-scale problems.
Abstract: The irbleigs code is an implementation of an implicitly restarted block-Lanczos method for computing a few selected nearby eigenvalues and associated eigenvectors of a large, possibly sparse, Hermitian matrix A. The code requires only the evaluation of matrix-vector products with A; in particular, factorization of A is not demanded, nor is the solution of linear systems of equations with the matrix A. This, together with a fairly small storage requirement, makes the irbleigs code well suited for large-scale problems. Applications of the irbleigs code to certain generalized eigenvalue problems and to the computation of a few singular values and associated singular vectors are also discussed. Numerous computed examples illustrate the performance of the method and provide comparisons with other available codes.

Proceedings ArticleDOI
16 Dec 2002
TL;DR: These designs significantly reduce the latency as well as the area and improve the previous designs in terms of the area/speed metric where the speed denotes the maximum achievable running frequency.
Abstract: We develop new algorithms and architectures for matrix multiplication on configurable hardware These designs significantly reduce the latency as well as the area Our designs improve the previous designs in terms of the area/speed metric where the speed denotes the maximum achievable running frequency The area/speed metrics for the previous designs and our design are 1445, 493, and 235, respectively, for 4 /spl times/ 4 matrix multiplication The latency of one of the previous design is 057 /spl mu/s, while our design takes 015 /spl mu/s using 18% less area The area of our designs is smaller by 11% - 46% compared with the best known systolic designs with the same latency for the matrices of sizes 3 /spl times/ 3 - 12 /spl times/ 12 The performance improvements tend to grow with the problem size

01 Jan 2002
TL;DR: In this paper, the authors present a fast matrix multiplication algorithm taken from [10] in a re ned compact "analytical" form and demonstrate that it can be implemented as quite efficient computer code.
Abstract: The main purpose of this paper is to present a fast matrix multiplication algorithm taken from [10] in a re ned compact "analytical" form and to demonstrate that it can be implemented as quite eAEcient computer code. Our improved presentation enables us to simplify substantially the analysis of the computational complexity and numerical stability of the algorithm as well as its computer implementation. The algorithm multiplies two N N matrices using O(N) arithmetic operations. In the case where N = 18 48, for a positive integer k, the total number of ops required by the algorithm is 4:893N 16:165N, which is quite competitive with a similar estimate for the Winograd algorithm, 3:732N 5N ops, N = 8 2, the latter being current record bound among all known practical algorithms. Moreover, we present a pseudo code of the algorithm which demonstrates its very moderate working memory requirements, much smaller than that of the best available implementations of Strassen andWinograd algorithms. We also reexamine an algorithm from [11] with operation count 3:682N 7:303N; N = 8 12, which performs well even for medium matrix sizes, e.g., N < 2000. For matrices of medium-large size (say, 2000 N < 10000) we consider one-level algorithms and compare them with the (multilevel) Strassen and Winograd algorithms. The results of numerical tests clearly indicate that our accelerated matrix multiplication routines implementing two or three disjoint product-based algorithm are comparable in computational time with an implementation of Winograd algorithm and clearly outperform it with respect to working space and (especially) numerical stability. The tests were performed for the matrices of the order of up to 7000, both in double and single precision.

Journal ArticleDOI
TL;DR: Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm, and the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high‐performance code is exposed.
Abstract: Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high-performance matrix-multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary-sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high-performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object-oriented or compiler-based approaches. Copyright © 2002 John Wiley & Sons, Ltd.

Patent
06 Jun 2002
TL;DR: In this paper, a group of instructions, block4 and block4 v, in a matrix processor 16 that rearranges data between vector and matrix forms of an A×B matrix of data 120 where the data matrix includes one or more 4×4 sub-matrices of data 160-166 is described.
Abstract: This invention discloses a group of instructions, block4 and block4 v, in a matrix processor 16 that rearranges data between vector and matrix forms of an A×B matrix of data 120 where the data matrix includes one or more 4×4 sub-matrices of data 160-166. The instructions of this invention simultaneously swaps row or columns between the first 140, second 142, third 144, and fourth 146 matrix registers according to the instructions that perform predefined matrix tensor operations on the data matrix that includes one of the following group of operations: swapping rows between the different individual matrix registers, or swapping columns between the different individual matrix registers. Additionally, successive iterations or combinations of the block4 and or block4 v instructions perform standard tensor matrix operations from the following group of matrix operations: transpose, shuffle, and deal.

Book ChapterDOI
03 Jul 2002
TL;DR: An algorithm is given to solve the minimum cycle basis problem for regular matroids based upon Seymour's decomposition theorem, the Gomory-Hu tree, which is essentially the solution for cographicMatroids; and the corresponding result for graphs.
Abstract: An algorithm is given to solve the minimum cycle basis problem for regular matroids. The result is based upon Seymour's decomposition theorem for regular matroids; the Gomory-Hu tree, which is essentially the solution for cographic matroids; and the corresponding result for graphs. The complexity of the algorithm is O((n + m)4), provided that a regular matroid is represented as a binary n×m matrix. The complexity decreases to O((n+m)3.376) using fast matrix multiplication.

01 Jan 2002
TL;DR: In this paper, the perturbation theory for the eigenvalue problem of a formal matrix product A s 1 1 ··· A sp p,w here all Ak are square and sk ∈{ −1, 1}.
Abstract: We study the perturbation theory for the eigenvalue problem of a formal matrix product A s 1 1 ··· A sp p ,w here allAk are square and sk ∈{ −1, 1}. We generalize the classical perturbation results for matrices and matrix pencils to perturbation results for generalized deflating subspaces and eigenvalues of such formal matrix products. As an application we then extend the structured perturbation theory for the eigenvalue problem of Hamiltonian matrices to Hamiltonian/skew-Hamiltonian pencils. AMS subject classification: 65F15, 93B40, 93B60, 65H17.

BookDOI
01 Jan 2002
TL;DR: In this paper, the authors present a model of planar and spatial rigid-body systems with a general universal joint and a set of constraints, including the shortest distance between two rotation axes.
Abstract: 1. Introduction.- 2. Planar and spatial vectors, matrices, and vector functions.- 3. Constraint equations and constraint reaction forces of mechanisms.- 4. Dynamics of planar and spatial rigid-body systems.- 5. Model equations of planar and spatial joints.- 6. Constitutive relations of planar and spatial external forces and torques.- A. Appendix.- A.1 Special vector and matrix operations used in mechanics.- A.1.1 Euclidean vector space.- A.1.2 Scalar product and cross product of planar vectors.- A.1.3 Cross product of spatial vectors.- A.1.4 Time derivatives of planar orientation matrices and of planar vectors in different frames.- A.1.5 Time derivatives of spatial orientation matrices and of spatial vectors in different frames.- A.1.6 Derivatives of vector functions.- A.2.1 Kinetic energy of an unconstrained rigid body.- A.2.3 Spatial equations of motion of a constrained rigid body.- A.4 Constraint equations of a general universal joint.- A.4.1 Notation and abbreviations.- A.4.2 Computation of constraint equations.- A.4.2.1 First constraint equation.- A.4.2.2 Second constraint equation.- A.4.2.3 Third constraint equation.- A.4.2.4 Fourth constraint equation.- A.4.3 Computation of the shortest distance between two rotation axes.- References.- List of figures.

Patent
03 Sep 2002
TL;DR: In this article, an integrated VMM (vector-matrix multiplier) module, including an electro-optical VMM component that multiplies an input vector by a matrix to produce an output vector, and an electronic VPU (vector processing unit) that processes at least one of the input and output vectors are discussed.
Abstract: An integrated VMM (vector-matrix multiplier) module, including an electro-optical VMM component that multiplies an input vector by a matrix to produce an output vector; and an electronic VPU (vector processing unit) that processes at least one of the input and output vectors. Various error reducing mechanisms are also discussed.

Patent
04 Jan 2002
TL;DR: In this paper, the number of rows of each of these matrices is equal to M*n. The number of columns of each matrix is the same as the column number of the matrix.
Abstract: Matrices to be used for the random orthogonal transformation of blocks of data in a transmission chain are generated. A square matrix with orthogonal column vectors and orthogonal row vectors is divided to create M matrices. The number of rows of each of these matrices is equal to M*n, where n is the number of columns of each of the matrices and M is an integer larger than one. Each of the M matrices is allocated to a transmitter in a transmission chain or, alternatively, a plurality of the M matrices are allocated to one base station of a wireless transmission system.

Journal ArticleDOI
TL;DR: In this article, the perturbation theory for the eigenvalue problem of a formal matrix product A 1 s 1 ··· A p s p, where all Ak are square and sk ∈ {−1, 1}.
Abstract: We study the perturbation theory for the eigenvalue problem of a formal matrix product A 1 s 1 ··· A p s p, where all Ak are square and sk ∈ {−1, 1}. We generalize the classical perturbation results for matrices and matrix pencils to perturbation results for generalized deflating subspaces and eigenvalues of such formal matrix products. As an application we then extend the structured perturbation theory for the eigenvalue problem of Hamiltonian matrices to Hamiltonian/skew-Hamiltonian pencils.

Journal ArticleDOI
TL;DR: This paper proposes an efficient methodology to evaluate reliability of large and complex systems based on minimal path sets and presents an improved multi-variable inversion (MVI) algorithm to evaluate system reliability in a compact form.
Abstract: Reliability evaluation of a large and complex system is quite an involved and time-consuming process and its state-of-art is far from being called as satisfactory. This is mainly due to the fact that unionizing path sets results in large number of terms in the reliability expression. Thereafter, the process of computing numerical value of system reliability from its expression is a task not free from the build up of round-off errors. The entire process also restricts the use of a low-end PC for computing system reliability of such systems. In this paper, we propose an efficient methodology to evaluate reliability of large and complex systems based on minimal path sets; the path sets enumeration procedure used in this paper generates path sets in lexicographic and increasing order of cardinality — a condition, which is helpful in obtaining sum of disjoint products (SDP) of the system reliability expression in a compact form. Although we make use of the system connection matrix but no complicated matrix operations are performed to obtain the results. The paper further presents an improved multi-variable inversion (MVI) algorithm to evaluate system reliability in a compact form. Our approach offers an extensive reduction in the number of mutually disjoint terms and provides a minimized and compact system reliability expression. The procedure not only results in substantial saving of CPU time but also can be run on a low-end PC. To demonstrate this capability, we solve several problems of varied complexities on a low-end PC and also provide a comparison of our approach with earlier techniques available for the purpose.

Book ChapterDOI
02 Sep 2002
TL;DR: These designs significantly reduce the energy dissipation and latency compared with the state-of-the-art FPGA-based designs and improve the energy performance of the optimized design from the recent Xilinx library by 32% to 88% without any increase in area-latency product.
Abstract: We develop new algorithms and architectures for matrix multiplication on configurable devices. These designs significantly reduce the energy dissipation and latency compared with the state-of-the-art FPGA-based designs. We derive functions to represent the impact of algorithmic level design choices on the system-wide energy dissipation, latency, and area by capturing algorithm and architecture details including features of the target FPGA. The functions are used to optimize energy performance under latency and area constraints for a family of candidate algorithms and architectures. As a result, our designs improve the energy performance of the optimized design from the recent Xilinx library by 32% to 88% without any increase in area-latency product. In terms of comprehensive metrics such as EAT (Energy-Area-Time) and E/AT (Energy/Area-Time), our designs offer superior performance compared with the Xilinx design by 50%-79% and 13%-44%, respectively. We also address how to exploit further increases in density of future FPGA devices for asymptotic improvement in latency and energy dissipation for multiplication of larger size matrices.

Proceedings ArticleDOI
16 Dec 2002
TL;DR: FPGAs can multiply two n /spl times/ n matrices with both lower latency and lower energy consumption than the other two types of devices, which makes FPGAs the ideal choice for matrix multiplication in signal processing applications.
Abstract: Advances in their technologies have positioned FPGAs and embedded processors to compete with digital signal processors (DSPs). In this paper, we evaluate the performance in terms of both latency and energy-efficiency of FPGAs, embedded processors, and DSPs in multiplying two n /spl times/ n matrices. As specific examples, we have chosen a representative of each type of device. Our results show that the FPGAs can multiply two n /spl times/ n matrices with both lower latency and lower energy consumption than the other two types of devices. This makes FPGAs the ideal choice for matrix multiplication in signal processing applications.

Patent
04 Sep 2002
TL;DR: In this paper, a functional unit that computes the product of a matrix operand with a vector operand, producing a vector result is proposed, which can fully utilize the entire resources of a 128b by 128b multiplier regardless of the operand size.
Abstract: The present invention provides a system and method for improving the performance of general-purpose processors by implementing a functional unit that computes the product of a matrix operand with a vector operand, producing a vector result. The functional unit fully utilizes the entire resources of a 128b by 128b multiplier regardless of the operand size, as the number of elements of the matrix and vector operands increase as operand size is reduced. The unit performs both fixed-point and floating-point multiplications and additions with the highest-possible intermediate accuracy with modest resources.

Journal ArticleDOI
TL;DR: In this paper, the Khatri-Rao and Tracy-Singh products for partitioned matrices are viewed as generalized Hadamard and generalized Kronecker products, respectively.

Journal ArticleDOI
TL;DR: In this paper, an optimal set of vectors with a specified inner product structure is constructed from a given set of vector vectors in a complex Hilbert space, and the optimal vectors are chosen to minimize the sum of the squared norms of the errors between the constructed vectors and the given vectors.