scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Optimizing sparse tensor times matrix on multi-core and many-core architectures

TL;DR: The optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms is presented, which is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition.
Abstract: This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5× faster than the SpTTM from Tensor Toolbox and 1.5× over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1× speedup on multicore Intel Core i7 and 18.8× speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.
Citations
More filters
Journal ArticleDOI
12 Oct 2017
TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
Abstract: Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save memory. Programmers are left to write kernels for every operation of interest, with different mixes of dense and sparse tensors in different formats. The combinations are infinite, which makes it impossible to manually implement and optimize them all. This paper introduces the first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors. The technique is implemented in a C++ library called taco. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.

240 citations

Journal ArticleDOI
TL;DR: A thorough evaluation on five latest types of modern GPU interconnects from six high-end servers and HPC platforms shows that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance.
Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

118 citations


Cites background from "Optimizing sparse tensor times matr..."

  • ...ence on GPU analytic modeling [38], [39], [40] and performance optimization [41], [42], [43], [44], [45], [46], [47], [48],...

    [...]

Proceedings ArticleDOI
11 Nov 2018
TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.
Abstract: This paper proposes a new storage format for sparse tensors, called Hierarchical COOrdinate (HiCOO; pronounced: "haiku"). It derives from coordinate (COO) format, arguably the de facto standard for general sparse tensor storage. HiCOO improves upon COO by compressing the indices in units of sparse tensor blocks, with the goals of preserving the "mode-agnostic" simplicity of COO while reducing the bytes needed to represent the tensor and promoting data locality. We evaluate HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm. This MTTKRP implementation achieves up to 23.0X (6.8X on average) speedup over COO format and up to 15.6X (3.1X on average) speedup over another state-of-the-art format, compressed sparse fiber (CSF), by using less or comparable storage of them. When used within CPD, we also observe speedups against COO- and CSF-based implementations.

83 citations


Cites background from "Optimizing sparse tensor times matr..."

  • ...Some tensor formats are proposed for structural sparse tensors or specific tensor operations, such as “mode-generic and mode-specific” [20] and semi-COOrdinate (sCOO) [21, 46] formats for the ones with dense modes, Extended Karnaugh Map Representation (EKMR) for some other tensor operations....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors conduct a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVlink-V2, NVSwitch, and NVLinkSLI.
Abstract: High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

80 citations

Proceedings Article
01 Jan 2018
TL;DR: Two randomized algorithms for low-rank Tucker decomposition of tensors, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order.
Abstract: We propose two randomized algorithms for low-rank Tucker decomposition of tensors. The algorithms, which incorporate sketching, only require a single pass of the input tensor and can handle tensors whose elements are streamed in any order. To the best of our knowledge, ours are the only algorithms which can do this. We test our algorithms on sparse synthetic data and compare them to multiple other methods. We also apply one of our algorithms to a real dense 38 GB tensor representing a video and use the resulting decomposition to correctly classify frames containing disturbances.

68 citations


Cites background from "Optimizing sparse tensor times matr..."

  • ...[24] Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc....

    [...]

  • ...Other papers that use memory efficient and distributed methods include [4, 23, 24, 25, 32, 18, 1, 19, 28]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This survey provides an overview of higher-order tensor decompositions, their applications, and available software.
Abstract: This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or $N$-way array. Decompositions of higher-order tensors (i.e., $N$-way arrays with $N \geq 3$) have applications in psycho-metrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition: CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.

9,227 citations


"Optimizing sparse tensor times matr..." refers background or methods in this paper

  • ...In this section, we test our algorithms on two platforms and also compare the performance with state-of-the-art Tensor Toolbox library [13]....

    [...]

  • ...Several examples and definitions are drawn from the overview by Kolda and Bader [13]....

    [...]

  • ...Since our SpTTM algorithm is a main kernel for lowrank decomposition, the rank R is usually a number smaller than 100 [13]....

    [...]

  • ...2Different from [13], We use the transposed form of the matrix U for efficient TTM in row-majored storage pattern of C language....

    [...]

  • ...) The speed of some of the most popular tensor decompositions, including the socalled Tucker decomposition [13], depend critically on having a fast SpTTM, thereby motivating this study....

    [...]

Proceedings Article
11 Jul 2010
TL;DR: This work proposes an approach and a set of design principles for an intelligent computer agent that runs forever and describes a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs.
Abstract: We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent

2,010 citations


"Optimizing sparse tensor times matr..." refers background or methods in this paper

  • ...We use sparse tensors from real applications including functional Magnetic Resonance Imaging (fMRI) measurements of brain activity [23] (“brainq” with noun-voxel-human), Never Ending Language Learning (NELL) project [22] (“nell1” and “nell2” with noun-verb-noun), and data crawled from tagging systems [24] (“deli” with user-item-tag)....

    [...]

  • ...For example the density of ‘nell2’ tensor from Never Ending Language Learning (NELL) project [22] is 2....

    [...]

Journal ArticleDOI
30 May 2008-Science
TL;DR: A computational model is presented that predicts the functional magnetic resonance imaging (fMRI) neural activation associated with words for which fMRI data are not yet available, trained with a combination of data from a trillion-word text corpus and observed f MRI data associated with viewing several dozen concrete nouns.
Abstract: The question of how the human brain represents conceptual knowledge has been debated in many scientific fields. Brain imaging studies have shown that different spatial patterns of neural activation are associated with thinking about different semantic categories of pictures and words (for example, tools, buildings, and animals). We present a computational model that predicts the functional magnetic resonance imaging (fMRI) neural activation associated with words for which fMRI data are not yet available. This model is trained with a combination of data from a trillion-word text corpus and observed fMRI data associated with viewing several dozen concrete nouns. Once trained, the model predicts fMRI activation for thousands of other concrete nouns in the text corpus, with highly significant accuracies over the 60 nouns for which we currently have fMRI data.

1,204 citations


"Optimizing sparse tensor times matr..." refers methods in this paper

  • ...We use sparse tensors from real applications including functional Magnetic Resonance Imaging (fMRI) measurements of brain activity [23] (“brainq” with noun-voxel-human), Never Ending Language Learning (NELL) project [22] (“nell1” and “nell2” with noun-verb-noun), and data crawled from tagging systems [24] (“deli” with user-item-tag)....

    [...]

Posted Content
TL;DR: A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices, and implies a robust and computationally tractable estimation approach for several popular latent variable models.
Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

842 citations


"Optimizing sparse tensor times matr..." refers background in this paper

  • ...Such applications arise in numerous domains, including neuroscience [1, 2], healthcare analytics [3–5], natural language processing [6], signal processing [7], machine learning [8, 9], and social network analytics [10]....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors consider a wide class of latent variable models, including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation, which exploit a certain tensor structure in their low-order observable moments (typically, of second and third-order).
Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models--including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation--which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

789 citations