scispace - formally typeset
Open AccessProceedings ArticleDOI

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

Reads0
Chats0
TLDR
SPLATT as discussed by the authors is a C library with shared-memory parallelism for three-mode tensors that uses a data structure that exploits the sparsity patterns of tensors.
Abstract
Multi-dimensional arrays, or tensors, are increasingly found in fields such as signal processing and recommender systems Real-world tensors can be enormous in size and often very sparse There is a need for efficient, high-performance tools capable of processing the massive sparse tensors of today and the future This paper introduces SPLATT, a C library with shared-memory parallelism for three-mode tensors SPLATT contains algorithmic improvements over competing state of the art tools for sparse tensor factorization SPLATT has a fast, parallel method of multiplying a matricide tensor by a Khatri-Rao product, which is a key kernel in tensor factorization methods SPLATT uses a novel data structure that exploits the sparsity patterns of tensors This data structure has a small memory footprint similar to competing methods and allows for the computational improvements featured in our work We also present a method of finding cache-friendly reordering and utilizing them with a novel form of cache tiling To our knowledge, this is the first work to investigate reordering and cache tiling in this context SPLATT averages almost 30x speedup compared to our baseline when using 16 threads and reaches over 80x speedup on NELL-2

read more

Content maybe subject to copyright    Report

Technical Report
Department of Computer Science
and Engineering
University of Minnesota
4-192 Keller Hall
200 Union Street SE
Minneapolis, MN 55455-0159 USA
TR 15-008
SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication
Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, George Karypis
May 13, 2015

Powered by TCPDF (www.tcpdf.org)

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication
Shaden Smith
, Niranjay Ravindran
, Nicholas D. Sidiropoulos
, George Karypis
University of Minnesota, Minneapolis, MN 55455, U.S.A.
{shaden, karypis}@cs.umn.edu,
{ravi0022,nikos}@umn.edu
Abstract—Multi-dimensional arrays, or tensors, are increas-
ingly found in fields such as signal processing and recommender
systems. Real-world tensors can be enormous in size and often
very sparse. There is a need for efficient, high-performance
tools capable of processing the massive sparse tensors of today
and the future. This paper introduces SPLATT, a C library with
shared-memory parallelism for three-mode tensors. SPLATT
contains algorithmic improvements over competing state of
the art tools for sparse tensor factorization. SPLATT has a fast,
parallel method of multiplying a matricized tensor by a Khatri-
Rao product, which is a key kernel in tensor factorization
methods. SPLATT uses a novel data structure that exploits the
sparsity patterns of tensors. This data structure has a small
memory footprint similar to competing methods and allows
for the computational improvements featured in our work. We
also present a method of finding cache-friendly reorderings
and utilizing them with a novel form of cache tiling. To our
knowledge, this is the first work to investigate reordering
and cache tiling in this context. SPLATT averages almost 30×
speedup compared to our baseline when using 16 threads and
reaches over 80× speedup on NELL-2.
Keywords-Sparse tensors, PARAFAC, CANDECOMP, CPD,
parallel
I. INTRODUCTION
Many application domains give rise to multi-way data that
can be naturally represented via tensors. For example, in
the context of a user content tagging system, e.g., Flickr
1
,
Delicious
2
, and Mendeley
3
the set of tags that a user in the
system provides to a piece of content is naturally represented
via a three-mode tensor of user-item-tag triplets. Similarly,
the three-way subject-verb-object relations that are being
extracted by the Never Ending Language Learning (NELL)
project [1] are represented via a three-mode tensor of noun-
verb-noun triplets.
This increased applicability of tensors has led to the
expanding use of tensor-based analysis techniques. The
Canonical Polyadic Decomposition (CPD) is one of the most
commonly used factorizations and has seen use in psycho-
metrics [2], signal processing [3], recommender systems [4],
and other fields. CPD, described in Section II-B, attempts
to decompose a tensor into a set of rank-one tensors. Such
a decomposition has numerous applications. For example,
in the context of the tagging system, it can be used to
recommend a set of tags to a user for a particular item or a
1
http://www.flickr.com
2
http://www.delicious.com
3
http://www.mendeley.com
set of items given a user and their previous tags. Similarly in
the context of NELL it can be used for noun-phrase concept
discovery and to detect contextual synonyms [5].
Though the methods for computing CPD are well un-
derstood in the context of dense tensors, most recent ap-
plications of tensor decomposition involve tensors that are
extremely large and very sparse. For example, NELL has
dimensions in the tens of millions and over one-hundred mil-
lion nonzero entries. Existing approaches for dense tensors
cannot be applied to sparse datasets because their memory
consumption scales with the tensor dimensions instead of
the number of nonzeros entries. To address this need, various
approaches that deal with sparse CPD have been proposed in
recent years. The Tensor Toolbox [6] is a widely used MAT-
LAB software package and uses an efficient algorithm that is
not hindered by extreme sparsity. However, Tensor Toolbox’s
algorithm cannot easily be parallelized and as such it cannot
leverage the multiple cores in today’s multiprocessors. On
the other hand, GigaTensor [5] uses an algorithm that is
explicitly designed for large-scale parallelism but requires
more floating-point operations (FLOPs) than other methods.
This paper introduces SPLATT, a C library for operating
on three-mode tensors. Our contributions are three-fold:
1) SPLATT contains algorithmic improvements over the
state of the art tools for factoring sparse tensors.
SPLATT has a fast, parallel method of multiplying
a matricized tensor by a Khatri-Rao product, a key
kernel in tensor factorizations.
2) SPLATT uses a novel data structure that is able to
exploit the sparsity patterns of tensors. This data
structure has a small memory footprint and allows for
the computational improvements featured in our work.
3) We present a method of finding cache-friendly reorder-
ings and utilize them with a novel form of cache tiling.
To our knowledge, this is the first work to investigate
reordering and cache tiling in this context.
We evaluate SPLATT across several large datasets of vary-
ing properties and demonstrate speedup over other compet-
ing methods on each one. We evaluate our method for cache-
friendly reordering and tiling by comparing against random
orderings of our datasets. Finally, we also demonstrate near-
linear scaling of our parallel algorithm.

II. PRELIMINARIES
In this section we provide a brief background on tensors
and their notation. We then describe the Canonical Polyadic
Decomposition, a widely used tensor factorization. For more
information on tensors and their factorizations, we direct the
reader to the essential survey by Kolda and Bader [7]. For
a thorough discussion on implementation details of tensor
computations in MATLAB, see [8].
A. Tensor Notation
Tensors are the generalization of matrices to higher di-
mensions. The dimensions occupied by the tensor are called
modes. We can also say a tensor with n modes is of order
n. For example, a tensor of order three takes the shape
of a box and a tensor of order two would simply be a
matrix. In this work we focus on third-order tensors because
they are the simplest to reason about and visualize. They
also have the added advantage of minimizing clutter due
to added indexing. However, we stress that all methods
presented in this work are easily extended to work with
higher-order tensors. The simplest and perhaps most popular
data structure for representing sparse tensors is a list of
(i, j, k, v) coordinates.
In this work we denote tensors as X and matrices as
A. We write the element in coordinate (i, j, k) of X as
X (i, j, k). Unless otherwise stated, the sparse tensor X is of
dimension I×J×K and has m nonzeros. We use the colon
notation of MATLAB, in which a colon in the place of an
index represents all members of that mode. For example,
A(:, r) is column r of the matrix A.
Fibers are a building block of tensors. Fibers are the
result of holding all but one index constant. The fibers of a
matrix are its rows and columns. In a third-order tensor, its
added fibers are referred to as tubes. Two possible fibers are
X (i, j, :) and X (i, :, k). A slice of a tensor is the result of
holding all but two indices constant. The result is a matrix
and two possible slices are X (i, :, :) and X (:, j, :).
Two essential operations on matrices used in the CPD
are the Hadamard product and the Khatri-Rao product.
The Hadamard product, denoted A B, is the element-wise
multiplication of A and B. The element (i, j) of A B is
A(i, j)B(i, j). A and B must match in dimension for the
Hadamard product to exist. The Khatri-Rao product, denoted
A B, is defined in terms of the Kronecker product
A B = [a
1
b
1
, a
2
b
2
, . . . , a
n
b
n
] .
A and B must have matching column dimension for their
Khatri-Rao product to be defined. If A is I×J and B is
M×J, then A B is IM×J. Figure 1 illustrates the Khatri-
Rao product of two small matrices.
A tensor can be matricized, or unfolded, into a matrix
along any of its modes. In the mode-n matricization, the
mode-n fibers are used to form the columns of the resulting
matrix. The mode-n unfolding of X is denoted as X
(n)
. If X
B =
b
11
b
12
b
21
b
22
C =
c
11
c
12
c
21
c
22
c
31
c
32
C B =
c
11
b
11
c
12
b
12
c
11
b
21
c
12
b
22
c
21
b
11
c
22
b
12
c
21
b
21
c
22
b
22
c
31
b
11
c
32
b
12
c
31
b
21
c
32
b
22
Figure 1: The Khatri-Rao product of two matrices.
is of dimension I×J×K, then X
(1)
is of dimension I×JK.
Figure 2 demonstrates the unfolding of a small tensor.
X (:, :, 1) =
1 2 3
4 5 6
X (:, :, 2) =
7 8 9
10 11 12
X
(1)
=
1 2 3 7 8 9
4 5 6 10 11 12
X
(2)
=
1 4 7 10
2 5 8 11
3 6 9 12
X
(3)
=
1 4 2 5 3 6
7 10 8 11 9 12
Figure 2: The matricizations of an (2×3×2) tensor.
B. Canonical Polyadic Decomposition
CPD is an extension of the Singular Value Decomposition
(SVD) to tensors. In the SVD, a matrix M is decomposed
into the summation F rank-one matrices, where F can
either be the rank of M or some smaller integer if a low-
rank approximation is desired. The SVD is most commonly
written in terms of three matrices M = UΣV
|
, where U
and V are unitary, Σ is a diagonal matrix of scaling factors,
and the ith rank-one matrix is the outer product of u
i
and v
i
.
Often, Σ is absorbed by scaling A and M is instead written
as M = AB
|
.
CPD extends this concept to factor a tensor into the
summation of F rank-one tensors. A rank-one tensor of
order n is the outer product of n vectors. Determining the
exact rank of a tensor is NP-hard [9] and we are almost
always interested in F max(I, J, K) for sparse tensors.
When computing the rank-F CPD of a third-order tensor,
we wish to find factor matrices A R
I×F
, B R
J×F
,
and C R
K×F
. A, B, and C are typically dense regardless
of the sparsity of X . The matricizations of X can also be
defined in terms of its CPD,
X
(1)
A(CB)
|
, X
(2)
B(CA)
|
, X
(3)
C(BA)
|
.
The method of Alternating Least Squares (ALS) is the
most commonly used algorithm for computing the CPD. In

each iteration we first fix B and C and solve for
ˆ
A via
ˆ
A = min
ˆ
A
||X
(1)
ˆ
A(C B)
|
||
2
F
.
The least squares problem is minimized by
ˆ
A = X
(1)
(C B)(C
|
C B
|
B)
,
where M
is the pseudo-inverse of M. (C
|
C B
|
B) is an
F ×F matrix, so computing its pseudo-inverse is a minor
computation relative to X
(1)
(C B). Once
ˆ
A is computed,
ˆ
B
and
ˆ
C are then solved for similarly. The process is repeated
until convergence.
C. Matricized Tensor Times Khatri-Rao Product
We denote M = X
(1)
(C B) as MTTKRP (matricized
tensor times Khatri-Rao product). MTTKRP is executed once
per mode per iteration of ALS. For simplicity of notation
and space, we only write MTTKRP in terms of operating on
the first mode. We define M to have I rows and B and C
to have J and K rows, respectively. M, B and C all have
F columns.
MTTKRP is often the bottleneck of computing the CPD.
Even though M is only an I×F matrix, C B is a
dense JK×F matrix which can occupy significantly more
memory than X . The size and the cost of forming C B
is prohibitive for all but the smallest sparse tensors. An
efficient MTTKRP implementation is essential for large-scale
tensor operations and C B cannot by explicitly formed in
practice.
III. RELATED WORK
Over the years a number of approaches have been devel-
oped for computing the MTTKRP. The most efficient of these
methods operate in O(m) time, but differ in implementation
difficulty, cache utilization, and opportunities for parallelism.
A. Sparse Tensor-Vector Products
Each column of M is a linear combination of the fibers
of X with columns of B and C. MTTKRP can be formulated
as a series of F tensor-vector products [8]. Using tensor-
vector products is the chosen method for several major
MATLAB implementations such as Tensor Toolbox [6] and
Tensorlab [10].
A three-mode tensor requires two tensor-vector products
per column of M. A temporary array t of size m is used to
“stretch” the vectors B(:, f) and C(:, f) to map to nonzeros
of X . For each column f, the two tensor-vector products are
performed at once and stored within t. Once t is filled, we
need to “shrink” it to a vector of length I. This essentially
sums all of the entries of t that correspond to nonzeros
X (i, :, :) and stores the result as M(i, f). Algorithm 1
presents pseudocode for computing tensor-vector products.
Using sparse tensor-vector products uses 3mF FLOPs
(2mF for the initial products and mF for the accumula-
tion steps) and m intermediate memory words for t. Each
Algorithm 1 MTTKRP via Sparse Tensor-Vector products.
Input: nonzeros of X and respective I, J, and K indices
Output: M
for f 0 to F do
for z 0 to m do Vector products
t[z] vals[z]B(indJ[z], f )C(indK[z], f)
end for
for z 0 to m do Accumulate M(:, f )
M(indI[z], f ) M(indI[z], f) + t[z]
end for
end for
nonzero can be processed in parallel during the “stretch”
stage because a nonzero will only modify a single element
of t. The accumulation step does not have this guarantee,
however, and must be executed serially. An advantage of this
method is that X does not require any special data structure
and it can be implemented in just a few lines of code in
MATLAB.
For a more in-depth overview of various tensor products,
we refer the reader to the work of Bader and Kolda [8], [7].
B. GigaTensor
GigaTensor [5] is a parallel CPD-ALS algorithm devel-
oped for the MapReduce [11] paradigm. GigaTensor utilizes
the massive parallelism of MapReduce by reformulating
MTTKRP as a series of Hadamard products. There are no
dependencies during a Hadamard product and each element
of the output can be computed in parallel.
GigaTensor avoids the construction of CB by separately
computing the contributions of B(:, f) and C(:, f) with X
(1)
via two Hadamard products. After computing the separate
contributions, the results are combined via a third Hadamard
product. The resulting matrix N has the same sparsity pattern
as X
(1)
and each nonzero entry N(i, y) is equal to
N(i, y) = X
(1)
(i, y)B(y%J, f)C(y/J, f). (1)
After computing the entries of N, the rows of the resulting
matrix are summed to form a column of M. The total process
requires 5mF FLOPs and m + max(J, K) intermediate
memory.
C. DFacTo
DFacTo [12] is a recent algorithm designed for distributed
tensor factorization. DFacTo uses an efficient MTTKRP al-
gorithm that is posed as a series of sparse matrix-vector
multiplications (SpMVs). M is computed one column at a
time and each column is formed by two SpMVs. DFacTo
first forms X
|
(2)
, an IK×J sparse matrix whose rows consist
of the mode-2 fibers of X . When forming column M(:, f )
we first compute X
|
(2)
B(:, f ) and store the result in the vals
field of M
r
, an I×K sparse matrix. Finally, we compute

Citations
More filters
Journal ArticleDOI

Tensor Decomposition for Signal Processing and Machine Learning

TL;DR: The material covered includes tensor rank and rank decomposition; basic tensor factorization models and their relationships and properties; broad coverage of algorithms ranging from alternating optimization to stochastic gradient; statistical performance analysis; and applications ranging from source separation to collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning.
Journal ArticleDOI

Tensors for Data Mining and Data Fusion: Models, Applications, and Scalable Algorithms

TL;DR: This survey presents some of the most widely used tensor decompositions, providing the key insights behind them, and summarizing them from a practitioner’s point of view.
Journal ArticleDOI

The tensor algebra compiler

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
Journal ArticleDOI

A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization

TL;DR: A hybrid between alternating optimization and the alternating direction method of multipliers, each matrix factor is updated in turn, using ADMM, hence the name AO-ADMM, which can naturally accommodate a great variety of constraints on the factor matrices, and almost all possible loss measures for the fitting.
Proceedings ArticleDOI

Parallel Tensor Compression for Large-Scale Scientific Data

TL;DR: This work presents the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

Tensor Decompositions and Applications

TL;DR: This survey provides an overview of higher-order tensor decompositions, their applications, and available software.
Journal ArticleDOI

Analysis of individual differences in multidimensional scaling via an n-way generalization of 'eckart-young' decomposition

TL;DR: In this paper, an individual differences model for multidimensional scaling is outlined in which individuals are assumed differentially to weight the several dimensions of a common "psychological space" and a corresponding method of analyzing similarities data is proposed, involving a generalization of Eckart-Young analysis to decomposition of three-way (or higher-way) tables.
Proceedings Article

Toward an architecture for never-ending language learning

TL;DR: This work proposes an approach and a set of design principles for an intelligent computer agent that runs forever and describes a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Splatt: efficient and parallel sparse tensor-matrix multiplication" ?

This paper introduces SPLATT, a C library with shared-memory parallelism for three-mode tensors. This data structure has a small memory footprint similar to competing methods and allows for the computational improvements featured in their work. The authors also present a method of finding cache-friendly reorderings and utilizing them with a novel form of cache tiling. To their knowledge, this is the first work to investigate reordering and cache tiling in this context. 

Their future work includes adapting SPLATT to this model investigating memory-scalable algorithms for tensor factorization in the context of distributed systems. 

The parallel version of SPLATT uses a task decomposition on the rows of M. Since the computation of M(i, :) requires only the nonzeros in slice X (i, :, :), the mode-1 slices of X can be distributed among processes. 

Relabeled mode-2 indices affect spatial locality and allow a fiber and its neighbors to access consecutive rows of B.A clear drawback of a mode-dependent reordering is the need to construct and partition a hypergraph for each mode. 

The objective of a mode-independent reordering is to find a single tensor permutation that results in improved execution time regardless of which mode MTTKRP is being performed on. 

Since the authors operate in a shared address space the authors are able to evenly distribute the The author′ rows of scratch space among threads and do a reduction with only a synchronization at the end. 

Dense regions are attractive because they offer increased cache performance while accessing B and C. Consider the execution of SPLATT along the first mode. 

Analogous to rows of a CSR matrix, (P+1) integers are required to store the start indices for the fibers and one integer is used for each fid. 

The least squares problem is minimized by = X(1)(C B)(CᵀC ∗ BᵀB)†,where M† is the pseudo-inverse of M. (CᵀC ∗ BᵀB) is an F×F matrix, so computing its pseudo-inverse is a minor computation relative to X(1)(C B). 

By storing fibers along the longer mode, the authors are able to minimize the number of stored fibers and increase the average fiber length. 

BrainQ is an interesting dataset because its dimensions are relatively small, resulting in a tensor several orders of magnitudes more dense than the other tensors studied in this work. 

The authors divide the indices into sets of size K ′ and arrange the X (i, :, k) fibers into tubes, each with a maximum of The author′ mode-1 indices and K ′ mode-3 indices. 

K∑ k=0 J∑ j=0 X (i, j, k)(B(j, :) ∗ C(k, :)) (3)M(i, :) = K∑k=0C(k, :) ∗ J∑j=0X (i, j, k)B(j, :) (4)First, the authors rewrite (1) to operate on a row of M. Next, the authors break the columns of X(1) into J and K components to arrive at (3). 

Their algorithm computes entire rows of M at a time and as a result only requires a single traversal of the sparse tensor structure. 

The extremely large dimensions that sparse tensors often exhibit are prohibitive to memory performance, even with a good reordering. 

The number of partitions in which a hyperedge is found (or, its connectivity) exactly models the number of times that its corresponding row in M, B, or C must be fetched from memory.