scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Scalable task-based algorithm for multiplication of block-rank-sparse matrices

15 Nov 2015-pp 4
TL;DR: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication, is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC).
Abstract: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization. Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).
Citations
More filters
Posted Content
TL;DR: Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.
Abstract: Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at this https URL .

121 citations


Cites methods from "Scalable task-based algorithm for m..."

  • ...This technique is sometimes called iteration space tiling [2], replication [6], or task parallelism [11]....

    [...]

Journal ArticleDOI
TL;DR: The theory and computer program for analytical nuclear energy gradients for (extended) multistate complete active space perturbation theory (CASPT2) with full internal contraction is reported, an extension of the fully internally contracted CASPT2 nuclear gradient program recently developed for a state-specific variant.
Abstract: We report the development of the theory and computer program for analytical nuclear energy gradients for (extended) multistate complete active space perturbation theory (CASPT2) with full internal contraction. The vertical shifts are also considered in this work. This is an extension of the fully internally contracted CASPT2 nuclear gradient program recently developed for a state-specific variant by us [MacLeod and Shiozaki, J. Chem. Phys. 2015, 142, 051103]; in this extension, the so-called λ equation is solved to account for the variation of the multistate CASPT2 energies with respect to the change in the amplitudes obtained in the preceding state-specific CASPT2 calculations, and the Z vector equations are modified accordingly. The program is parallelized using the MPI3 remote memory access protocol that allows us to perform efficient one-sided communication. The optimized geometries of the ground and excited states of a copper corrole and benzophenone are presented as numerical examples. The code is p...

95 citations

Journal ArticleDOI
TL;DR: The features and capabilities of MADNESS are described and some current applications in chemistry and several areas of physics are discussed.
Abstract: MADNESS (multiresolution adaptive numerical environment for scientific simulation) is a high-level software environment for solving integral and differential equations in many dimensions that uses adaptive and fast harmonic analysis methods with guaranteed precision based on multiresolution analysis and separated representations. Underpinning the numerical capabilities is a powerful petascale parallel programming environment that aims to increase both programmer productivity and code scalability. This paper describes the features and capabilities of MADNESS and briefly discusses some current applications in chemistry and several areas of physics.

77 citations


Cites methods from "Scalable task-based algorithm for m..."

  • ...These abilities, as provided by the MADNESS runtime, are also used by TiledArray [18] (a framework for block-sparse tensor computations) to hide communication costs and withstand load imbalances in handling block-sparse data....

    [...]

  • ...In a similar fashion, the MADNESS parallel runtime is being successfully used for petascale computations independent of the numerical layer [18, 19], illustrating the power and utility of the massively threaded, task-based approach to computation....

    [...]

  • ...These abilities, as provided by the MADNESS runtime, are also used by TiledArray [19, 18] (a framework for block-sparse tensor computations) to hide communication costs and withstand load imbalances in handling block-sparse data....

    [...]

Journal ArticleDOI
TL;DR: Sharma et al. as discussed by the authors presented two efficient and intruder-free methods for treating dynamic correlation on top of general multiconfiguration reference wave functions, including such as obtained by the density matrix renormalization group (DMRG) with large active spaces.
Abstract: We present two efficient and intruder-free methods for treating dynamic correlation on top of general multiconfiguration reference wave functions — including such as obtained by the density matrix renormalization group (DMRG) with large active spaces. The new methods are the second order variant of the recently proposed multireference linearized coupled cluster method (MRLCC) [Sharma, S.; Alavi, A. J. Chem. Phys. 2015, 143, 102815] and of N-electron valence perturbation theory (NEVPT2), with expected accuracies similar to MRCI+Q and (at least) CASPT2, respectively. Great efficiency gains are realized by representing the first order wave function with a combination of internal contraction (IC) and matrix product state perturbation theory (MPSPT). With this combination, only third order reduced density matrices (RDMs) are required. Thus, we obviate the need for calculating (or estimating) RDMs of fourth or higher order; these had so far posed a severe bottleneck for dynamic correlation treatments involving ...

65 citations

Journal ArticleDOI
TL;DR: TBLIS as mentioned in this paper implements tensor contraction using the flexible BLAS-like Instantiation Software (BLIS) framework, which allows transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace.
Abstract: Tensor computations---in particular tensor contraction (TC)---are important kernels in many scientific computing applications. Due to the fundamental similarity of TC to matrix multiplication and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLAS-like Instantiation Software (BLIS) framework, which allows for transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of matrix multiplication, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for matrix multiplication in BLIS, with similar performance characteristics. The complexity...

55 citations

References
More filters
Book
01 Jan 1982
TL;DR: In this paper, modern in-depth approaches to the calculation of the electronic structure and properties of molecules Hartree-Fock approximation, electron pair approximation, much more Largely self-contained, only prerequisite is solid course in physical chemistry Over 150 exercises 1989 edition
Abstract: Graduate-level text explains modern in-depth approaches to the calculation of the electronic structure and properties of molecules Hartree-Fock approximation, electron pair approximation, much more Largely self-contained, only prerequisite is solid course in physical chemistry Over 150 exercises 1989 edition

3,110 citations

Journal ArticleDOI
TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).
Abstract: t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4 . 7 n l°g7 arithmetical operations (all logarithms in this paper are for base 2, thus tog 7 ~ 2.8; the usual method requires approximately 2n 3 arithmetical operations). The algorithm induces algorithms for invert ing a matr ix of order n, solving a system of n linear equations in n unknowns, comput ing a determinant of order n etc. all requiring less than const n l°g 7 arithmetical operations. This fact should be compared with the result of KLYUYEV and KOKOVKINSHCHERBAK [1 ] tha t Gaussian elimination for solving a system of l inearequations is optimal if one restricts oneself to operations upon rows and columns as a whole. We also note tha t WlNOGRAD [21 modifies the usual algorithms for matr ix multiplication and inversion and for solving systems of linear equations, trading roughly half of the multiplications for additions and subtractions. I t is a pleasure to thank D. BRILLINGER for inspiring discussions about the present subject and ST. COOK and B. PARLETT for encouraging me to write this paper. 2. We define algorithms e~, ~ which mult iply matrices of order m2 ~, by induction on k: ~ , 0 is the usual algorithm, for matr ix multiplication (requiring m a multiplications and m 2 ( m t) additions), e~,k already being known, define ~ , ~ +t as follows: If A, B are matrices of order m 2 k ~ to be multiplied, write

2,581 citations


"Scalable task-based algorithm for m..." refers background in this paper

  • ...A frontier challenge posed by scientific and engineering applications in areas as distinct as quantum physics and machine learning is dealing with sparse and non-standard tensorial data representations....

    [...]

Book
01 Jan 1987
TL;DR: This book is very referred for you because it gives not only the experience but also lesson, it is about this book that will give wellness for all people from many societies.
Abstract: Where you can find the scalapack users guide easily? Is it in the book store? On-line book store? are you sure? Keep in mind that you will find the book in this site. This book is very referred for you because it gives not only the experience but also lesson. The lessons are very valuable to serve for you, that's not about who are reading this scalapack users guide book. It is about this book that will give wellness for all people from many societies.

945 citations

Journal ArticleDOI
TL;DR: An efficient production level implementation of the closed shell CEPA and CPF methods is reported that can be applied to medium sized molecules and has essentially the same accuracy as parent CEPA (CPF) methods for thermochemistry, kinetics, weak interactions, and potential energy surfaces but is up to 500 times faster.
Abstract: Coupled-electron pair approximations (CEPAs) and coupled-pair functionals (CPFs) have been popular in the 1970s and 1980s and have yielded excellent results for small molecules. Recently, interest in CEPA and CPF methods has been renewed. It has been shown that these methods lead to competitive thermochemical, kinetic, and structural predictions. They greatly surpass second order Moller–Plesset and popular density functional theory based approaches in accuracy and are intermediate in quality between CCSD and CCSD(T) in extended benchmark studies. In this work an efficient production level implementation of the closed shell CEPA and CPF methods is reported that can be applied to medium sized molecules in the range of 50–100 atoms and up to about 2000 basis functions. The internal space is spanned by localized internal orbitals. The external space is greatly compressed through the method of pair natural orbitals (PNOs) that was also introduced by the pioneers of the CEPA approaches. Our implementation also ...

497 citations


"Scalable task-based algorithm for m..." refers background in this paper

  • ...Keywords distributed memory, matrix multiplication, SUMMA, lowrank decomposition, irregular computation, rank-structured, matrix, H matrix, semiseparable matrix, task parallelism, tensor contraction...

    [...]

01 Jan 2005
TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).
Abstract: t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4 . 7 n l°g7 arithmetical operations (all logarithms in this paper are for base 2, thus tog 7 ~ 2.8; the usual method requires approximately 2n 3 arithmetical operations). The algorithm induces algorithms for invert ing a matr ix of order n, solving a system of n linear equations in n unknowns, comput ing a determinant of order n etc. all requiring less than const n l°g 7 arithmetical operations. This fact should be compared with the result of KLYUYEV and KOKOVKINSHCHERBAK [1 ] tha t Gaussian elimination for solving a system of l inearequations is optimal if one restricts oneself to operations upon rows and columns as a whole. We also note tha t WlNOGRAD [21 modifies the usual algorithms for matr ix multiplication and inversion and for solving systems of linear equations, trading roughly half of the multiplications for additions and subtractions. I t is a pleasure to thank D. BRILLINGER for inspiring discussions about the present subject and ST. COOK and B. PARLETT for encouraging me to write this paper. 2. We define algorithms e~, ~ which mult iply matrices of order m2 ~, by induction on k: ~ , 0 is the usual algorithm, for matr ix multiplication (requiring m a multiplications and m 2 ( m t) additions), e~,k already being known, define ~ , ~ +t as follows: If A, B are matrices of order m 2 k ~ to be multiplied, write

497 citations