Optimizing sparse tensor times matrix on multi-core and many-core architectures

doi:10.5555/3018843.3018848

Open AccessProceedings ArticleDOI

Optimizing sparse tensor times matrix on multi-core and many-core architectures

Jiajia Li, +3 more

- pp 26-33

Chats0

TLDR

The optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms is presented, which is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition.

Abstract:

This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5× faster than the SpTTM from Tensor Toolbox and 1.5× over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1× speedup on multicore Intel Core i7 and 18.8× speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

The tensor algebra compiler

Fredrik Kjolstad, +4 more

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.

...read moreread less

Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Ang Li, +6 more

- 01 Jan 2020 -

IEEE Transactions on Parallel and Distri...

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.

...read moreread less

Proceedings ArticleDOI

HiCOO: hierarchical storage of sparse tensors

Jiajia Li, +2 more

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.

...read moreread less

Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Ang Li, +6 more

- 11 Mar 2019 -

arXiv: Hardware Architecture

TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.

...read moreread less

Proceedings Article

Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch

Osman Asif Malik, +1 more

TL;DR: Two randomized algorithms for low-rank Tucker decomposition of tensors, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

Edgar Solomonik, +3 more

TL;DR: This work demonstrates performance of CC with single and double excitations on 8192 nodes of Blue Gene/Q and shows that CTF outperforms NWChem on Cray XE6 supercomputers for benchmarked systems.

...read moreread less

Proceedings ArticleDOI

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data

Weifeng Liu, +1 more

TL;DR: This work presents a GPU SpGEMM algorithm that particularly focuses on load balancing, memory pre-allocation for the result matrix, and parallel insert operations of the nonzero entries that is experimentally found to be the fastest GPU merge approach.

...read moreread less

Posted Content

Tensor Decompositions: A New Concept in Brain Data Analysis?

Andrzej Cichocki

- 02 May 2013 -

arXiv: Numerical Analysis

TL;DR: New and emerging models and approaches for tensor decompositions in applications to group and linked multiway BSS/ICA, feature extraction, classification and Multiway Parti al Least Squares (MPLS) are overviewed.

...read moreread less

Journal ArticleDOI

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Ariful Azad, +7 more

- 03 Oct 2015 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: In this article, the authors present the first implementation of the 3D SpGEMM formulation that exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrency.

...read moreread less

Proceedings Article

PINTS: peer-to-peer infrastructure for tagging systems

Olaf Görlitz, +2 more

TL;DR: This paper introduces the vector space model for characterization of users, resources, and tags and analyzes the problem of constructing a reliable approximation for feature vectors in a fully decentralized setting and introduces possible solutions.

...read moreread less

Collapse

Optimizing sparse tensor times matrix on multi-core and many-core architectures

Citations

The tensor algebra compiler

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

HiCOO: hierarchical storage of sparse tensors

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch

References

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data

Tensor Decompositions: A New Concept in Brain Data Analysis?

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

PINTS: peer-to-peer infrastructure for tagging systems

Related Papers (5)

Tensor Decompositions and Applications

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

Tensor-matrix products with a compressed sparse tensor

GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Scalable sparse tensor decompositions in distributed memory systems