scispace - formally typeset
Open AccessProceedings ArticleDOI

Optimizing sparse tensor times matrix on multi-core and many-core architectures

Reads0
Chats0
TLDR
The optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms is presented, which is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition.
Abstract
This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5× faster than the SpTTM from Tensor Toolbox and 1.5× over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1× speedup on multicore Intel Core i7 and 18.8× speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.

read more

Citations
More filters
Journal ArticleDOI

The tensor algebra compiler

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.
Proceedings ArticleDOI

HiCOO: hierarchical storage of sparse tensors

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.
Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.
Proceedings Article

Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch

TL;DR: Two randomized algorithms for low-rank Tucker decomposition of tensors, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order.
References
More filters
Posted Content

Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions

TL;DR: This paper addresses the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library, and establishes precise conditions to determine if and when GEMM, the kernel for matrix products, can be used.
Proceedings ArticleDOI

A communication-optimal framework for contracting distributed tensors

TL;DR: A framework with three fundamental communication operators to generate communication-efficient contraction algorithms for arbitrary tensor contractions is developed and it is shown that for a given amount of memory per processor, the framework is communication optimal for all tensorcontractions.
Journal ArticleDOI

Sparse Matrix-Matrix Products Executed Through Coloring

TL;DR: This paper proposes a new algorithm for computing sparse matrix-matrix products by exploiting their nonzero structure through the process of graph coloring and proves its viability for examples including multigrid methods used to solve boundary value problems as well as matrix products appearing in unstructured applications.
Posted Content

Scalable Latent Tree Model and its Application to Health Analytics

TL;DR: An integrated approach to structure and parameter estimation in latent tree graphical models, where some nodes are hidden, which is guaranteed to correctly recover the unknown tree structure and the model parameters with low sample complexity for the class of linear multivariate latent tree models which includes discrete andGaussian distributions, and Gaussian mixtures.

Communication Lower Bounds for Tensor Contraction Algorithms

TL;DR: It is proved that any schedule of the symmetry preserving algorithm requires asymptotically more vertical and horizontal communication than the direct evaluation algorithm for some fully symmetric contractions, and that for the instances of fully symmetrical contractions that arise in quantum chemistry calculations, lower bounds are asymPTotically the same for both of these algorithms.