Optimizing sparse tensor times matrix on multi-core and many-core architectures

doi:10.5555/3018843.3018848

Open AccessProceedings ArticleDOI

Optimizing sparse tensor times matrix on multi-core and many-core architectures

Jiajia Li, +3 more

- pp 26-33

Chats0

TLDR

The optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms is presented, which is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition.

Abstract:

This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5× faster than the SpTTM from Tensor Toolbox and 1.5× over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1× speedup on multicore Intel Core i7 and 18.8× speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

The tensor algebra compiler

Fredrik Kjolstad, +4 more

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.

...read moreread less

Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Ang Li, +6 more

- 01 Jan 2020 -

IEEE Transactions on Parallel and Distri...

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.

...read moreread less

Proceedings ArticleDOI

HiCOO: hierarchical storage of sparse tensors

Jiajia Li, +2 more

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.

...read moreread less

Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Ang Li, +6 more

- 11 Mar 2019 -

arXiv: Hardware Architecture

TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.

...read moreread less

Proceedings Article

Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch

Osman Asif Malik, +1 more

TL;DR: Two randomized algorithms for low-rank Tucker decomposition of tensors, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions

Edoardo Di Napoli, +4 more

- 08 Jul 2013 -

arXiv: Mathematical Software

TL;DR: This paper addresses the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library, and establishes precise conditions to determine if and when GEMM, the kernel for matrix products, can be used.

...read moreread less

Proceedings ArticleDOI

A communication-optimal framework for contracting distributed tensors

Samyam Rajbhandari, +5 more

TL;DR: A framework with three fundamental communication operators to generate communication-efficient contraction algorithms for arbitrary tensor contractions is developed and it is shown that for a given amount of memory per processor, the framework is communication optimal for all tensorcontractions.

...read moreread less

Journal ArticleDOI

Sparse Matrix-Matrix Products Executed Through Coloring

Michael McCourt, +2 more

- 29 Jan 2015 -

SIAM Journal on Matrix Analysis and Appl...

TL;DR: This paper proposes a new algorithm for computing sparse matrix-matrix products by exploiting their nonzero structure through the process of graph coloring and proves its viability for examples including multigrid methods used to solve boundary value problems as well as matrix products appearing in unstructured applications.

...read moreread less

Posted Content

Scalable Latent Tree Model and its Application to Health Analytics

Furong Huang, +5 more

- 02 Mar 2017 -

arXiv: Learning

TL;DR: An integrated approach to structure and parameter estimation in latent tree graphical models, where some nodes are hidden, which is guaranteed to correctly recover the unknown tree structure and the model parameters with low sample complexity for the class of linear multivariate latent tree models which includes discrete andGaussian distributions, and Gaussian mixtures.

...read moreread less

DOI

Communication Lower Bounds for Tensor Contraction Algorithms

Edgar Solomonik, +2 more

TL;DR: It is proved that any schedule of the symmetry preserving algorithm requires asymptotically more vertical and horizontal communication than the direct evaluation algorithm for some fully symmetric contractions, and that for the instances of fully symmetrical contractions that arise in quantum chemistry calculations, lower bounds are asymPTotically the same for both of these algorithms.

...read moreread less

Collapse

Optimizing sparse tensor times matrix on multi-core and many-core architectures

Citations

The tensor algebra compiler

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

HiCOO: hierarchical storage of sparse tensors

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch

References

Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions

A communication-optimal framework for contracting distributed tensors

Sparse Matrix-Matrix Products Executed Through Coloring

Scalable Latent Tree Model and its Application to Health Analytics

Communication Lower Bounds for Tensor Contraction Algorithms

Related Papers (5)

Tensor Decompositions and Applications

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

Tensor-matrix products with a compressed sparse tensor

GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Scalable sparse tensor decompositions in distributed memory systems