Scalable task-based algorithm for multiplication of block-rank-sparse matrices

doi:10.1145/2833179.2833186

Home
/
Papers
/
Scalable task-based algorithm for multiplication of block-rank-sparse matrices

Proceedings Article•DOI•

Scalable task-based algorithm for multiplication of block-rank-sparse matrices

Justus A. Calvin¹, Cannada A. Lewis¹, Edward F. Valeev¹•Institutions (1)

Virginia Tech¹

15 Nov 2015-pp 4

TL;DR: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication, is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC).

read less

Abstract: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization. Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

Mesh-TensorFlow: Deep Learning for Supercomputers

[...]

Noam Shazeer¹, Youlong Cheng¹, Niki Parmar¹, Dustin Tran¹, Ashish Vaswani¹, Penporn Koanantakool², Peter Hawkins¹, HyoukJoong Lee¹, Mingsheng Hong, Cliff Young¹, Ryan Sepassi, Blake A. Hechtman¹ - Show less +8 more•Institutions (2)

Google¹, University of California, Berkeley²

05 Nov 2018-arXiv: Learning

TL;DR: Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.

...read moreread less

Abstract: Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at this https URL .

...read moreread less

121 citations

Cites methods from "Scalable task-based algorithm for m..."

...This technique is sometimes called iteration space tiling [2], replication [6], or task parallelism [11]....
[...]

Journal Article•DOI•

Nuclear Energy Gradients for Internally Contracted Complete Active Space Second-Order Perturbation Theory: Multistate Extensions.

[...]

Bess Vlaisavljevich¹, Toru Shiozaki¹•Institutions (1)

Northwestern University¹

20 Jul 2016-Journal of Chemical Theory and Computation

TL;DR: The theory and computer program for analytical nuclear energy gradients for (extended) multistate complete active space perturbation theory (CASPT2) with full internal contraction is reported, an extension of the fully internally contracted CASPT2 nuclear gradient program recently developed for a state-specific variant.

...read moreread less

Abstract: We report the development of the theory and computer program for analytical nuclear energy gradients for (extended) multistate complete active space perturbation theory (CASPT2) with full internal contraction. The vertical shifts are also considered in this work. This is an extension of the fully internally contracted CASPT2 nuclear gradient program recently developed for a state-specific variant by us [MacLeod and Shiozaki, J. Chem. Phys. 2015, 142, 051103]; in this extension, the so-called λ equation is solved to account for the variation of the multistate CASPT2 energies with respect to the change in the amplitudes obtained in the preceding state-specific CASPT2 calculations, and the Z vector equations are modified accordingly. The program is parallelized using the MPI3 remote memory access protocol that allows us to perform efficient one-sided communication. The optimized geometries of the ground and excited states of a copper corrole and benzophenone are presented as numerical examples. The code is p...

...read moreread less

95 citations

Journal Article•DOI•

MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation

[...]

Robert W. Harrison, Gregory Beylkin, Florian A. Bischoff, Justus A. Calvin, George I. Fann, Jacob Fosso-Tande, Diego Galindo, Jeff R. Hammond, Rebecca Hartman-Baker, Judith Hill, Jun Jia, Jakob S. Kottmann, M-J. Yvonne Ou, Laura E. Ratcliff, Matthew G. Reuter, Adam Richie-Halford, Nichols A. Romero, Hideo Sekino, William A. Shelton, Bryan Sundahl, W. Scott Thornton, Edward F. Valeev, Álvaro Vázquez-Mayagoitia, Nicholas Vence, Yukina Yokoi - Show less +21 more

05 Jul 2015-arXiv: Mathematical Software

TL;DR: The features and capabilities of MADNESS are described and some current applications in chemistry and several areas of physics are discussed.

...read moreread less

Abstract: MADNESS (multiresolution adaptive numerical environment for scientific simulation) is a high-level software environment for solving integral and differential equations in many dimensions that uses adaptive and fast harmonic analysis methods with guaranteed precision based on multiresolution analysis and separated representations. Underpinning the numerical capabilities is a powerful petascale parallel programming environment that aims to increase both programmer productivity and code scalability. This paper describes the features and capabilities of MADNESS and briefly discusses some current applications in chemistry and several areas of physics.

...read moreread less

77 citations

Cites methods from "Scalable task-based algorithm for m..."

...These abilities, as provided by the MADNESS runtime, are also used by TiledArray [18] (a framework for block-sparse tensor computations) to hide communication costs and withstand load imbalances in handling block-sparse data....
[...]
...In a similar fashion, the MADNESS parallel runtime is being successfully used for petascale computations independent of the numerical layer [18, 19], illustrating the power and utility of the massively threaded, task-based approach to computation....
[...]
...These abilities, as provided by the MADNESS runtime, are also used by TiledArray [19, 18] (a framework for block-sparse tensor computations) to hide communication costs and withstand load imbalances in handling block-sparse data....
[...]

Journal Article•DOI•

Combining Internally Contracted States and Matrix Product States To Perform Multireference Perturbation Theory.

[...]

Sandeep Sharma¹, Gerald Knizia², Sheng Guo³, Ali Alavi⁴•Institutions (4)

University of Colorado Boulder¹, Pennsylvania State University², Princeton University³, University of Cambridge⁴

31 Jan 2017-Journal of Chemical Theory and Computation

TL;DR: Sharma et al. as discussed by the authors presented two efficient and intruder-free methods for treating dynamic correlation on top of general multiconfiguration reference wave functions, including such as obtained by the density matrix renormalization group (DMRG) with large active spaces.

...read moreread less

Abstract: We present two efficient and intruder-free methods for treating dynamic correlation on top of general multiconfiguration reference wave functions — including such as obtained by the density matrix renormalization group (DMRG) with large active spaces. The new methods are the second order variant of the recently proposed multireference linearized coupled cluster method (MRLCC) [Sharma, S.; Alavi, A. J. Chem. Phys. 2015, 143, 102815] and of N-electron valence perturbation theory (NEVPT2), with expected accuracies similar to MRCI+Q and (at least) CASPT2, respectively. Great efficiency gains are realized by representing the first order wave function with a combination of internal contraction (IC) and matrix product state perturbation theory (MPSPT). With this combination, only third order reduced density matrices (RDMs) are required. Thus, we obviate the need for calculating (or estimating) RDMs of fourth or higher order; these had so far posed a severe bottleneck for dynamic correlation treatments involving ...

...read moreread less

65 citations

Journal Article•DOI•

High-Performance Tensor Contraction without Transposition

[...]

Devin A. Matthews

02 Jan 2018-SIAM Journal on Scientific Computing

TL;DR: TBLIS as mentioned in this paper implements tensor contraction using the flexible BLAS-like Instantiation Software (BLIS) framework, which allows transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace.

...read moreread less

Abstract: Tensor computations---in particular tensor contraction (TC)---are important kernels in many scientific computing applications. Due to the fundamental similarity of TC to matrix multiplication and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLAS-like Instantiation Software (BLIS) framework, which allows for transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of matrix multiplication, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for matrix multiplication in BLIS, with similar performance characteristics. The complexity...

...read moreread less

55 citations

1
2
3
4
…
5
6
7

Collapse

References

PDF

Open Access

More filters

Book•

Modern quantum chemistry : introduction to advanced electronic structure theory

[...]

Attila Szabo, Neil S. Ostlund

01 Jan 1982

TL;DR: In this paper, modern in-depth approaches to the calculation of the electronic structure and properties of molecules Hartree-Fock approximation, electron pair approximation, much more Largely self-contained, only prerequisite is solid course in physical chemistry Over 150 exercises 1989 edition

...read moreread less

Abstract: Graduate-level text explains modern in-depth approaches to the calculation of the electronic structure and properties of molecules Hartree-Fock approximation, electron pair approximation, much more Largely self-contained, only prerequisite is solid course in physical chemistry Over 150 exercises 1989 edition

...read moreread less

3,110 citations

Journal Article•DOI•

Gaussian elimination is not optimal

[...]

Volker Strassen¹•Institutions (1)

University of Zurich¹

01 Aug 1969-Numerische Mathematik

TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).

...read moreread less

Abstract: t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4 . 7 n l°g7 arithmetical operations (all logarithms in this paper are for base 2, thus tog 7 ~ 2.8; the usual method requires approximately 2n 3 arithmetical operations). The algorithm induces algorithms for invert ing a matr ix of order n, solving a system of n linear equations in n unknowns, comput ing a determinant of order n etc. all requiring less than const n l°g 7 arithmetical operations. This fact should be compared with the result of KLYUYEV and KOKOVKINSHCHERBAK [1 ] tha t Gaussian elimination for solving a system of l inearequations is optimal if one restricts oneself to operations upon rows and columns as a whole. We also note tha t WlNOGRAD [21 modifies the usual algorithms for matr ix multiplication and inversion and for solving systems of linear equations, trading roughly half of the multiplications for additions and subtractions. I t is a pleasure to thank D. BRILLINGER for inspiring discussions about the present subject and ST. COOK and B. PARLETT for encouraging me to write this paper. 2. We define algorithms e~, ~ which mult iply matrices of order m2 ~, by induction on k: ~ , 0 is the usual algorithm, for matr ix multiplication (requiring m a multiplications and m 2 ( m t) additions), e~,k already being known, define ~ , ~ +t as follows: If A, B are matrices of order m 2 k ~ to be multiplied, write

...read moreread less

2,581 citations

"Scalable task-based algorithm for m..." refers background in this paper

...A frontier challenge posed by scientific and engineering applications in areas as distinct as quantum physics and machine learning is dealing with sparse and non-standard tensorial data representations....
[...]

Book•

ScaLAPACK Users' Guide

[...]

L. S. Blackford, Jae-Young Choi, A. Cleary, Eduardo D'Azevedo, James Demmel, Inderjit S. Dhillon, Jack Dongarra, Sven Hammarling, Greg Henry, A. Petitet, K. Stanley, David W. Walker, R. C. Whaley - Show less +9 more

01 Jan 1987

TL;DR: This book is very referred for you because it gives not only the experience but also lesson, it is about this book that will give wellness for all people from many societies.

...read moreread less

Abstract: Where you can find the scalapack users guide easily? Is it in the book store? On-line book store? are you sure? Keep in mind that you will find the book in this site. This book is very referred for you because it gives not only the experience but also lesson. The lessons are very valuable to serve for you, that's not about who are reading this scalapack users guide book. It is about this book that will give wellness for all people from many societies.

...read moreread less

945 citations

Journal Article•DOI•

Efficient and accurate local approximations to coupled-electron pair approaches: An attempt to revive the pair natural orbital method

[...]

Frank Neese¹, Frank Neese², Frank Wennmohs², Andreas Hansen²•Institutions (2)

Max Planck Society¹, University of Bonn²

20 Mar 2009-Journal of Chemical Physics

TL;DR: An efficient production level implementation of the closed shell CEPA and CPF methods is reported that can be applied to medium sized molecules and has essentially the same accuracy as parent CEPA (CPF) methods for thermochemistry, kinetics, weak interactions, and potential energy surfaces but is up to 500 times faster.

...read moreread less

Abstract: Coupled-electron pair approximations (CEPAs) and coupled-pair functionals (CPFs) have been popular in the 1970s and 1980s and have yielded excellent results for small molecules. Recently, interest in CEPA and CPF methods has been renewed. It has been shown that these methods lead to competitive thermochemical, kinetic, and structural predictions. They greatly surpass second order Moller–Plesset and popular density functional theory based approaches in accuracy and are intermediate in quality between CCSD and CCSD(T) in extended benchmark studies. In this work an efficient production level implementation of the closed shell CEPA and CPF methods is reported that can be applied to medium sized molecules in the range of 50–100 atoms and up to about 2000 basis functions. The internal space is spanned by localized internal orbitals. The external space is greatly compressed through the method of pair natural orbitals (PNOs) that was also introduced by the pioneers of the CEPA approaches. Our implementation also ...

...read moreread less

497 citations

"Scalable task-based algorithm for m..." refers background in this paper

...Keywords distributed memory, matrix multiplication, SUMMA, lowrank decomposition, irregular computation, rank-structured, matrix, H matrix, semiseparable matrix, task parallelism, tensor contraction...
[...]

Gaussian Elimination is not Optimal

[...]

Volker ~Trassen

01 Jan 2005

...read moreread less

497 citations