scispace - formally typeset
Open AccessJournal ArticleDOI

Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures

TLDR
This paper develops parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures and develops a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem.
Abstract
Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM , to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

read more

Citations
More filters
Proceedings ArticleDOI

IA-SpGEMM: an input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication

TL;DR: IA-SpGEMM is proposed, an input-aware auto-tuning Framework for SpGemM that provides a unified programming interface in the CSR format and automatically determines the best format and algorithm for arbitrary sparse matrices.
Journal ArticleDOI

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

TL;DR: The key idea of the spGEMM algorithm, tSparse, is to multiply sparse rectangular blocks using the mixed precision mode of TCUs, the first time that TCUs are used in the context of spGemM.
Proceedings ArticleDOI

Adaptive sparse matrix-matrix multiplication on the GPU

TL;DR: Evaluation on an extensive sparse matrix benchmark suggests this approach being the fastest SpGEMM implementation for highly sparse matrices (80% of the set) and when bit-stable results are sought, the approach is the fastest across the entire test set.
Proceedings ArticleDOI

Fast Triangle Counting Using Cilk

TL;DR: This paper develops an SpGEMM implementation that relies on a highly efficient, work-stealing, multithreaded runtime, and presents analysis of the scaling of the triangle counting implementation as the graph sizes increase using both synthetic and real graphs from the graph challenge data set.
Proceedings ArticleDOI

Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels

TL;DR: This work bases their sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library, and uses the sparse matrix-matrix multiplication in Kok Kos Kernels to reuse a highly optimized kernel.
References
More filters
Journal ArticleDOI

The university of Florida sparse matrix collection

TL;DR: The University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications, is described and a new multilevel coarsening scheme is proposed to facilitate this task.
Journal ArticleDOI

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns

TL;DR: Kokkos’ abstractions are described, its application programmer interface (API) is summarized, performance results for unit-test kernels and mini-applications are presented, and an incremental strategy for migrating legacy C++ codes to Kokkos is outlined.
Book ChapterDOI

Intel Math Kernel Library

TL;DR: In order to achieve optimal performance on multi-core and multi-processor systems, the features of parallelism and manage the memory hierarchical characters efficiently need to be used.
Journal ArticleDOI

The Combinatorial BLAS: design, implementation, and applications

TL;DR: The parallel Combinatorial BLAS is described, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications, and an extensible library interface and some guiding principles for future development are provided.
Journal ArticleDOI

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition

TL;DR: An O(M) algorithm is produced to solve A x = b where M is the number of multiplications needed to factor A into L U and the concept of an unordered merge plays a key role in obtaining this algorithm.
Related Papers (5)