scispace - formally typeset
Open AccessProceedings ArticleDOI

Optimizing sparse tensor times matrix on multi-core and many-core architectures

Reads0
Chats0
TLDR
The optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms is presented, which is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition.
Abstract
This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5× faster than the SpTTM from Tensor Toolbox and 1.5× over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1× speedup on multicore Intel Core i7 and 18.8× speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.

read more

Citations
More filters
Journal ArticleDOI

The tensor algebra compiler

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.
Proceedings ArticleDOI

HiCOO: hierarchical storage of sparse tensors

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.
Journal ArticleDOI

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.
Proceedings Article

Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch

TL;DR: Two randomized algorithms for low-rank Tucker decomposition of tensors, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order.
References
More filters
Proceedings ArticleDOI

Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions

TL;DR: This work demonstrates performance of CC with single and double excitations on 8192 nodes of Blue Gene/Q and shows that CTF outperforms NWChem on Cray XE6 supercomputers for benchmarked systems.
Proceedings ArticleDOI

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data

TL;DR: This work presents a GPU SpGEMM algorithm that particularly focuses on load balancing, memory pre-allocation for the result matrix, and parallel insert operations of the nonzero entries that is experimentally found to be the fastest GPU merge approach.
Posted Content

Tensor Decompositions: A New Concept in Brain Data Analysis?

TL;DR: New and emerging models and approaches for tensor decompositions in applications to group and linked multiway BSS/ICA, feature extraction, classification and Multiway Parti al Least Squares (MPLS) are overviewed.
Journal ArticleDOI

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

TL;DR: In this article, the authors present the first implementation of the 3D SpGEMM formulation that exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrency.
Proceedings Article

PINTS: peer-to-peer infrastructure for tagging systems

TL;DR: This paper introduces the vector space model for characterization of users, resources, and tags and analyzes the problem of constructing a reliable approximation for feature vectors in a fully decentralized setting and introduces possible solutions.