Optimizing sparse tensor times matrix on multi-core and many-core architectures

doi:10.5555/3018843.3018848

Home
/
Papers
/
Optimizing sparse tensor times matrix on multi-core and many-core architectures

Proceedings Article•DOI•

Optimizing sparse tensor times matrix on multi-core and many-core architectures

Jiajia Li¹, Yuchen Ma², Chenggang Yan², Richard Vuduc¹•Institutions (2)

Georgia Institute of Technology¹, Hangzhou Dianzi University²

13 Nov 2016-pp 26-33

TL;DR: The optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms is presented, which is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition.

read less

Abstract: This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5× faster than the SpTTM from Tensor Toolbox and 1.5× over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1× speedup on multicore Intel Core i7 and 18.8× speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

The tensor algebra compiler

[...]

Fredrik Kjolstad¹, Shoaib Kamil², Stephen Chou¹, David Lugato, Saman Amarasinghe¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Adobe Systems²

12 Oct 2017

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.

...read moreread less

Abstract: Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save memory. Programmers are left to write kernels for every operation of interest, with different mixes of dense and sparse tensors in different formats. The combinations are infinite, which makes it impossible to manually implement and optimize them all. This paper introduces the first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors. The technique is implemented in a C++ library called taco. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.

...read moreread less

240 citations

Journal Article•DOI•

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

[...]

Ang Li¹, Shuaiwen Leon Song¹, Jieyang Chen², Jiajia Li¹, Xu Liu³, Nathan R. Tallent¹, Kevin J. Barker¹ - Show less +3 more•Institutions (3)

Pacific Northwest National Laboratory¹, Oak Ridge National Laboratory², College of William & Mary³

01 Jan 2020-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.

...read moreread less

Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

...read moreread less

118 citations

Cites background from "Optimizing sparse tensor times matr..."

...ence on GPU analytic modeling [38], [39], [40] and performance optimization [41], [42], [43], [44], [45], [46], [47], [48],...
[...]

Proceedings Article•DOI•

HiCOO: hierarchical storage of sparse tensors

[...]

Jiajia Li¹, Jimeng Sun¹, Richard Vuduc¹•Institutions (1)

Georgia Institute of Technology¹

11 Nov 2018

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.

...read moreread less

Abstract: This paper proposes a new storage format for sparse tensors, called Hierarchical COOrdinate (HiCOO; pronounced: "haiku"). It derives from coordinate (COO) format, arguably the de facto standard for general sparse tensor storage. HiCOO improves upon COO by compressing the indices in units of sparse tensor blocks, with the goals of preserving the "mode-agnostic" simplicity of COO while reducing the bytes needed to represent the tensor and promoting data locality. We evaluate HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm. This MTTKRP implementation achieves up to 23.0X (6.8X on average) speedup over COO format and up to 15.6X (3.1X on average) speedup over another state-of-the-art format, compressed sparse fiber (CSF), by using less or comparable storage of them. When used within CPD, we also observe speedups against COO- and CSF-based implementations.

...read moreread less

83 citations

Cites background from "Optimizing sparse tensor times matr..."

...Some tensor formats are proposed for structural sparse tensors or specific tensor operations, such as “mode-generic and mode-specific” [20] and semi-COOrdinate (sCOO) [21, 46] formats for the ones with dense modes, Extended Karnaugh Map Representation (EKMR) for some other tensor operations....
[...]

Journal Article•DOI•

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

[...]

Ang Li¹, Shuaiwen Leon Song¹, Jieyang Chen², Jiajia Li¹, Xu Liu³, Nathan R. Tallent¹, Kevin J. Barker¹ - Show less +3 more•Institutions (3)

Pacific Northwest National Laboratory¹, Oak Ridge National Laboratory², College of William & Mary³

11 Mar 2019-arXiv: Hardware Architecture

TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.

...read moreread less

80 citations

Proceedings Article•

Low-Rank Tucker Decomposition of Large Tensors Using TensorSketch

[...]

Osman Asif Malik¹, Stephen Becker¹•Institutions (1)

University of Colorado Boulder¹

01 Jan 2018

TL;DR: Two randomized algorithms for low-rank Tucker decomposition of tensors, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order.

...read moreread less

Abstract: We propose two randomized algorithms for low-rank Tucker decomposition of tensors. The algorithms, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order. To the best of our knowledge, ours are the only algorithms which can do this. We test our algorithms on sparse synthetic data and compare them to multiple other methods. We also apply one of our algorithms to a real dense 38 GB tensor representing a video and use the resulting decomposition to correctly classify frames containing disturbances.

...read moreread less

68 citations

Cites background from "Optimizing sparse tensor times matr..."

...[24] Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc....
[...]
...Other papers that use memory efficient and distributed methods include [4, 23, 24, 25, 32, 18, 1, 19, 28]....
[...]

1
2
3
4
…
5
6
7
8

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Tensor Decompositions and Applications

[...]

Tamara G. Kolda¹, Brett W. Bader¹•Institutions (1)

Sandia National Laboratories¹

01 Aug 2009-Siam Review

TL;DR: This survey provides an overview of higher-order tensor decompositions, their applications, and available software.

...read moreread less

Abstract: This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or $N$-way array. Decompositions of higher-order tensors (i.e., $N$-way arrays with $N \geq 3$) have applications in psycho-metrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition: CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.

...read moreread less

9,227 citations

"Optimizing sparse tensor times matr..." refers background or methods in this paper

...In this section, we test our algorithms on two platforms and also compare the performance with state-of-the-art Tensor Toolbox library [13]....
[...]
...Several examples and definitions are drawn from the overview by Kolda and Bader [13]....
[...]
...Since our SpTTM algorithm is a main kernel for lowrank decomposition, the rank R is usually a number smaller than 100 [13]....
[...]
...2Different from [13], We use the transposed form of the matrix U for efficient TTM in row-majored storage pattern of C language....
[...]
...) The speed of some of the most popular tensor decompositions, including the socalled Tucker decomposition [13], depend critically on having a fast SpTTM, thereby motivating this study....
[...]

Proceedings Article•

Toward an architecture for never-ending language learning

[...]

Andrew Carlson¹, Justin Betteridge¹, Bryan Kisiel¹, Burr Settles¹, Estevam R. Hruschka², Tom M. Mitchell¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Federal University of São Carlos²

11 Jul 2010

TL;DR: This work proposes an approach and a set of design principles for an intelligent computer agent that runs forever and describes a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs.

...read moreread less

Abstract: We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent

...read moreread less

2,010 citations

"Optimizing sparse tensor times matr..." refers background or methods in this paper

...We use sparse tensors from real applications including functional Magnetic Resonance Imaging (fMRI) measurements of brain activity [23] (“brainq” with noun-voxel-human), Never Ending Language Learning (NELL) project [22] (“nell1” and “nell2” with noun-verb-noun), and data crawled from tagging systems [24] (“deli” with user-item-tag)....
[...]
...For example the density of ‘nell2’ tensor from Never Ending Language Learning (NELL) project [22] is 2....
[...]

Journal Article•DOI•

Predicting Human Brain Activity Associated with the Meanings of Nouns

[...]

Tom M. Mitchell¹, Svetlana V. Shinkareva², Andrew Carlson¹, Kai-Min Chang¹, Vicente L. Malave³, Robert A. Mason¹, Marcel Adam Just¹ - Show less +3 more•Institutions (3)

Carnegie Mellon University¹, University of South Carolina², University of California, San Diego³

30 May 2008-Science

TL;DR: A computational model is presented that predicts the functional magnetic resonance imaging (fMRI) neural activation associated with words for which fMRI data are not yet available, trained with a combination of data from a trillion-word text corpus and observed f MRI data associated with viewing several dozen concrete nouns.

...read moreread less

Abstract: The question of how the human brain represents conceptual knowledge has been debated in many scientific fields. Brain imaging studies have shown that different spatial patterns of neural activation are associated with thinking about different semantic categories of pictures and words (for example, tools, buildings, and animals). We present a computational model that predicts the functional magnetic resonance imaging (fMRI) neural activation associated with words for which fMRI data are not yet available. This model is trained with a combination of data from a trillion-word text corpus and observed fMRI data associated with viewing several dozen concrete nouns. Once trained, the model predicts fMRI activation for thousands of other concrete nouns in the text corpus, with highly significant accuracies over the 60 nouns for which we currently have fMRI data.

...read moreread less

1,204 citations

"Optimizing sparse tensor times matr..." refers methods in this paper

...We use sparse tensors from real applications including functional Magnetic Resonance Imaging (fMRI) measurements of brain activity [23] (“brainq” with noun-voxel-human), Never Ending Language Learning (NELL) project [22] (“nell1” and “nell2” with noun-verb-noun), and data crawled from tagging systems [24] (“deli” with user-item-tag)....
[...]

Posted Content•

Tensor decompositions for learning latent variable models

[...]

Animashree Anandkumar¹, Rong Ge², Daniel Hsu³, Sham M. Kakade², Matus Telgarsky⁴ - Show less +1 more•Institutions (4)

University of California, Irvine¹, Microsoft², Columbia University³, Rutgers University⁴

29 Oct 2012-arXiv: Learning

TL;DR: A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices, and implies a robust and computationally tractable estimation approach for several popular latent variable models.

...read moreread less

Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

...read moreread less

842 citations

"Optimizing sparse tensor times matr..." refers background in this paper

...Such applications arise in numerous domains, including neuroscience [1, 2], healthcare analytics [3–5], natural language processing [6], signal processing [7], machine learning [8, 9], and social network analytics [10]....
[...]

Journal Article•DOI•

Tensor decompositions for learning latent variable models

[...]

Animashree Anandkumar¹, Rong Ge², Daniel Hsu³, Sham M. Kakade², Matus Telgarsky⁴ - Show less +1 more•Institutions (4)

University of California, Irvine¹, Microsoft², Columbia University³, Rutgers University⁴

01 Jan 2014-Journal of Machine Learning Research

TL;DR: In this article, the authors consider a wide class of latent variable models, including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation, which exploit a certain tensor structure in their low-order observable moments (typically, of second and third-order).

...read moreread less

Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models--including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation--which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

...read moreread less

789 citations