Optimizing sparse tensor times matrix on multi-core and many-core architectures
Jiajia Li,Yuchen Ma,Chenggang Yan,Richard Vuduc +3 more
- pp 26-33
Reads0
Chats0
TLDR
The optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms is presented, which is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition.Abstract:
This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5× faster than the SpTTM from Tensor Toolbox and 1.5× over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1× speedup on multicore Intel Core i7 and 18.8× speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.read more
Citations
More filters
Journal ArticleDOI
The tensor algebra compiler
TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
Journal ArticleDOI
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect
TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.
Proceedings ArticleDOI
HiCOO: hierarchical storage of sparse tensors
TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.
Journal ArticleDOI
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect
TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.
Proceedings Article
Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch
Osman Asif Malik,Stephen Becker +1 more
TL;DR: Two randomized algorithms for low-rank Tucker decomposition of tensors, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order.
References
More filters
Posted Content
Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions
Edoardo Di Napoli,Edoardo Di Napoli,Diego Fabregat-Traver,Gregorio Quintana-Ortí,Paolo Bientinesi +4 more
TL;DR: This paper addresses the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library, and establishes precise conditions to determine if and when GEMM, the kernel for matrix products, can be used.
Proceedings ArticleDOI
A communication-optimal framework for contracting distributed tensors
TL;DR: A framework with three fundamental communication operators to generate communication-efficient contraction algorithms for arbitrary tensor contractions is developed and it is shown that for a given amount of memory per processor, the framework is communication optimal for all tensorcontractions.
Journal ArticleDOI
Sparse Matrix-Matrix Products Executed Through Coloring
TL;DR: This paper proposes a new algorithm for computing sparse matrix-matrix products by exploiting their nonzero structure through the process of graph coloring and proves its viability for examples including multigrid methods used to solve boundary value problems as well as matrix products appearing in unstructured applications.
Posted Content
Scalable Latent Tree Model and its Application to Health Analytics
TL;DR: An integrated approach to structure and parameter estimation in latent tree graphical models, where some nodes are hidden, which is guaranteed to correctly recover the unknown tree structure and the model parameters with low sample complexity for the class of linear multivariate latent tree models which includes discrete andGaussian distributions, and Gaussian mixtures.
Communication Lower Bounds for Tensor Contraction Algorithms
TL;DR: It is proved that any schedule of the symmetry preserving algorithm requires asymptotically more vertical and horizontal communication than the direct evaluation algorithm for some fully symmetric contractions, and that for the instances of fully symmetrical contractions that arise in quantum chemistry calculations, lower bounds are asymPTotically the same for both of these algorithms.