What have the authors contributed in "Splatt: efficient and parallel sparse tensor-matrix multiplication" ?

This paper introduces SPLATT, a C library with shared-memory parallelism for three-mode tensors. This data structure has a small memory footprint similar to competing methods and allows for the computational improvements featured in their work. The authors also present a method of finding cache-friendly reorderings and utilizing them with a novel form of cache tiling. To their knowledge, this is the first work to investigate reordering and cache tiling in this context.

What have the authors stated for future works in "Splatt: efficient and parallel sparse tensor-matrix multiplication" ?

Their future work includes adapting SPLATT to this model investigating memory-scalable algorithms for tensor factorization in the context of distributed systems.

What is the purpose of the parallel version of SPLATT?

The parallel version of SPLATT uses a task decomposition on the rows of M. Since the computation of M(i, :) requires only the nonzeros in slice X (i, :, :), the mode-1 slices of X can be distributed among processes.

What is the drawback of a mode-dependent reordering?

Relabeled mode-2 indices affect spatial locality and allow a fiber and its neighbors to access consecutive rows of B.A clear drawback of a mode-dependent reordering is the need to construct and partition a hypergraph for each mode.

What is the objective of a mode-independent reordering?

The objective of a mode-independent reordering is to find a single tensor permutation that results in improved execution time regardless of which mode MTTKRP is being performed on.

What is the advantage of a shared address space?

Since the authors operate in a shared address space the authors are able to evenly distribute the The author′ rows of scratch space among threads and do a reduction with only a synchronization at the end.

What is the way to speed up the execution of SPLATT?

Dense regions are attractive because they offer increased cache performance while accessing B and C. Consider the execution of SPLATT along the first mode.

What is the corresponding indices for each fiber?

Analogous to rows of a CSR matrix, (P+1) integers are required to store the start indices for the fibers and one integer is used for each fid.

What is the common method of computing the least squares problem?

The least squares problem is minimized byÂ = X(1)(C B)(CᵀC ∗ BᵀB)†,where M† is the pseudo-inverse of M. (CᵀC ∗ BᵀB) is an F×F matrix, so computing its pseudo-inverse is a minor computation relative to X(1)(C B).

What is the way to store fibers?

By storing fibers along the longer mode, the authors are able to minimize the number of stored fibers and increase the average fiber length.

What is the tensor of the brainq dataset?

BrainQ is an interesting dataset because its dimensions are relatively small, resulting in a tensor several orders of magnitudes more dense than the other tensors studied in this work.

What is the simplest way to divide the mode-1 indices into tubes?

The authors divide the indices into sets of size K ′ and arrange the X (i, :, k) fibers into tubes, each with a maximum of The author′ mode-1 indices and K ′ mode-3 indices.

what is the simplest way to factor out the X(1) columns?

K∑ k=0 J∑ j=0 X (i, j, k)(B(j, :) ∗ C(k, :)) (3)M(i, :) = K∑k=0C(k, :) ∗ J∑j=0X (i, j, k)B(j, :) (4)First, the authors rewrite (1) to operate on a row of M. Next, the authors break the columns of X(1) into J and K components to arrive at (3).

What is the simplest way to compute a tensor?

Their algorithm computes entire rows of M at a time and as a result only requires a single traversal of the sparse tensor structure.

What is the problem with the reordering of the mode?

The extremely large dimensions that sparse tensors often exhibit are prohibitive to memory performance, even with a good reordering.

How many partitions are used to model memory accesses?

The number of partitions in which a hyperedge is found (or, its connectivity) exactly models the number of times that its corresponding row in M, B, or C must be fetched from memory.

(Open Access) SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication (2015) | Shaden Smith

Technical Report

Department of Computer Science

and Engineering

University of Minnesota

4-192 Keller Hall

200 Union Street SE

Minneapolis, MN 55455-0159 USA

TR 15-008

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, George Karypis

May 13, 2015

SPLATT: Efﬁcient and Parallel Sparse Tensor-Matrix Multiplication

Shaden Smith

∗

, Niranjay Ravindran

†

, Nicholas D. Sidiropoulos

†

, George Karypis

∗

University of Minnesota, Minneapolis, MN 55455, U.S.A.

∗

{shaden, karypis}@cs.umn.edu,

†

{ravi0022,nikos}@umn.edu

Abstract—Multi-dimensional arrays, or tensors, are increas-

ingly found in ﬁelds such as signal processing and recommender

systems. Real-world tensors can be enormous in size and often

very sparse. There is a need for efﬁcient, high-performance

tools capable of processing the massive sparse tensors of today

and the future. This paper introduces SPLATT, a C library with

shared-memory parallelism for three-mode tensors. SPLATT

contains algorithmic improvements over competing state of

the art tools for sparse tensor factorization. SPLATT has a fast,

parallel method of multiplying a matricized tensor by a Khatri-

Rao product, which is a key kernel in tensor factorization

methods. SPLATT uses a novel data structure that exploits the

sparsity patterns of tensors. This data structure has a small

memory footprint similar to competing methods and allows

for the computational improvements featured in our work. We

also present a method of ﬁnding cache-friendly reorderings

and utilizing them with a novel form of cache tiling. To our

knowledge, this is the ﬁrst work to investigate reordering

and cache tiling in this context. SPLATT averages almost 30×

speedup compared to our baseline when using 16 threads and

reaches over 80× speedup on NELL-2.

Keywords-Sparse tensors, PARAFAC, CANDECOMP, CPD,

parallel

I. INTRODUCTION

Many application domains give rise to multi-way data that

can be naturally represented via tensors. For example, in

the context of a user content tagging system, e.g., Flickr

Delicious

, and Mendeley

the set of tags that a user in the

system provides to a piece of content is naturally represented

via a three-mode tensor of user-item-tag triplets. Similarly,

the three-way subject-verb-object relations that are being

extracted by the Never Ending Language Learning (NELL)

project [1] are represented via a three-mode tensor of noun-

verb-noun triplets.

This increased applicability of tensors has led to the

expanding use of tensor-based analysis techniques. The

Canonical Polyadic Decomposition (CPD) is one of the most

commonly used factorizations and has seen use in psycho-

metrics [2], signal processing [3], recommender systems [4],

and other ﬁelds. CPD, described in Section II-B, attempts

to decompose a tensor into a set of rank-one tensors. Such

a decomposition has numerous applications. For example,

in the context of the tagging system, it can be used to

recommend a set of tags to a user for a particular item or a

http://www.ﬂickr.com

http://www.delicious.com

http://www.mendeley.com

set of items given a user and their previous tags. Similarly in

the context of NELL it can be used for noun-phrase concept

discovery and to detect contextual synonyms [5].

Though the methods for computing CPD are well un-

derstood in the context of dense tensors, most recent ap-

plications of tensor decomposition involve tensors that are

extremely large and very sparse. For example, NELL has

dimensions in the tens of millions and over one-hundred mil-

lion nonzero entries. Existing approaches for dense tensors

cannot be applied to sparse datasets because their memory

consumption scales with the tensor dimensions instead of

the number of nonzeros entries. To address this need, various

approaches that deal with sparse CPD have been proposed in

recent years. The Tensor Toolbox [6] is a widely used MAT-

LAB software package and uses an efﬁcient algorithm that is

not hindered by extreme sparsity. However, Tensor Toolbox’s

algorithm cannot easily be parallelized and as such it cannot

leverage the multiple cores in today’s multiprocessors. On

the other hand, GigaTensor [5] uses an algorithm that is

explicitly designed for large-scale parallelism but requires

more ﬂoating-point operations (FLOPs) than other methods.

This paper introduces SPLATT, a C library for operating

on three-mode tensors. Our contributions are three-fold:

1) SPLATT contains algorithmic improvements over the

state of the art tools for factoring sparse tensors.

SPLATT has a fast, parallel method of multiplying

a matricized tensor by a Khatri-Rao product, a key

kernel in tensor factorizations.

2) SPLATT uses a novel data structure that is able to

exploit the sparsity patterns of tensors. This data

structure has a small memory footprint and allows for

the computational improvements featured in our work.

3) We present a method of ﬁnding cache-friendly reorder-

ings and utilize them with a novel form of cache tiling.

To our knowledge, this is the ﬁrst work to investigate

reordering and cache tiling in this context.

We evaluate SPLATT across several large datasets of vary-

ing properties and demonstrate speedup over other compet-

ing methods on each one. We evaluate our method for cache-

friendly reordering and tiling by comparing against random

orderings of our datasets. Finally, we also demonstrate near-

linear scaling of our parallel algorithm.

II. PRELIMINARIES

In this section we provide a brief background on tensors

and their notation. We then describe the Canonical Polyadic

Decomposition, a widely used tensor factorization. For more

information on tensors and their factorizations, we direct the

reader to the essential survey by Kolda and Bader [7]. For

a thorough discussion on implementation details of tensor

computations in MATLAB, see [8].

A. Tensor Notation

Tensors are the generalization of matrices to higher di-

mensions. The dimensions occupied by the tensor are called

modes. We can also say a tensor with n modes is of order

n. For example, a tensor of order three takes the shape

of a box and a tensor of order two would simply be a

matrix. In this work we focus on third-order tensors because

they are the simplest to reason about and visualize. They

also have the added advantage of minimizing clutter due

to added indexing. However, we stress that all methods

presented in this work are easily extended to work with

higher-order tensors. The simplest and perhaps most popular

data structure for representing sparse tensors is a list of

(i, j, k, v) coordinates.

In this work we denote tensors as X and matrices as

A. We write the element in coordinate (i, j, k) of X as

X (i, j, k). Unless otherwise stated, the sparse tensor X is of

dimension I×J×K and has m nonzeros. We use the colon

notation of MATLAB, in which a colon in the place of an

index represents all members of that mode. For example,

A(:, r) is column r of the matrix A.

Fibers are a building block of tensors. Fibers are the

result of holding all but one index constant. The ﬁbers of a

matrix are its rows and columns. In a third-order tensor, its

added ﬁbers are referred to as tubes. Two possible ﬁbers are

X (i, j, :) and X (i, :, k). A slice of a tensor is the result of

holding all but two indices constant. The result is a matrix

and two possible slices are X (i, :, :) and X (:, j, :).

Two essential operations on matrices used in the CPD

are the Hadamard product and the Khatri-Rao product.

The Hadamard product, denoted A ∗ B, is the element-wise

multiplication of A and B. The element (i, j) of A ∗ B is

A(i, j)B(i, j). A and B must match in dimension for the

Hadamard product to exist. The Khatri-Rao product, denoted

A  B, is deﬁned in terms of the Kronecker product

A  B = [a

⊗ b

, a

⊗ b

, . . . , a

⊗ b

] .

A and B must have matching column dimension for their

Khatri-Rao product to be deﬁned. If A is I×J and B is

M×J, then A B is IM×J. Figure 1 illustrates the Khatri-

Rao product of two small matrices.

A tensor can be matricized, or unfolded, into a matrix

along any of its modes. In the mode-n matricization, the

mode-n ﬁbers are used to form the columns of the resulting

matrix. The mode-n unfolding of X is denoted as X

(n)

. If X

B =





C =









C  B =













Figure 1: The Khatri-Rao product of two matrices.

is of dimension I×J×K, then X

(1)

is of dimension I×JK.

Figure 2 demonstrates the unfolding of a small tensor.

X (:, :, 1) =



1 2 3

4 5 6



X (:, :, 2) =



7 8 9

10 11 12



(1)



1 2 3 7 8 9

4 5 6 10 11 12



(2)





1 4 7 10

2 5 8 11

3 6 9 12





(3)



1 4 2 5 3 6

7 10 8 11 9 12



Figure 2: The matricizations of an (2×3×2) tensor.

B. Canonical Polyadic Decomposition

CPD is an extension of the Singular Value Decomposition

(SVD) to tensors. In the SVD, a matrix M is decomposed

into the summation F rank-one matrices, where F can

either be the rank of M or some smaller integer if a low-

rank approximation is desired. The SVD is most commonly

written in terms of three matrices M = UΣV

, where U

and V are unitary, Σ is a diagonal matrix of scaling factors,

and the ith rank-one matrix is the outer product of u

and v

Often, Σ is absorbed by scaling A and M is instead written

as M = AB

CPD extends this concept to factor a tensor into the

summation of F rank-one tensors. A rank-one tensor of

order n is the outer product of n vectors. Determining the

exact rank of a tensor is NP-hard [9] and we are almost

always interested in F  max(I, J, K) for sparse tensors.

When computing the rank-F CPD of a third-order tensor,

we wish to ﬁnd factor matrices A ∈ R

I×F

, B ∈ R

J×F

and C ∈ R

K×F

. A, B, and C are typically dense regardless

of the sparsity of X . The matricizations of X can also be

deﬁned in terms of its CPD,

(1)

≈ A(CB)

, X

(2)

≈ B(CA)

, X

(3)

≈ C(BA)

The method of Alternating Least Squares (ALS) is the

most commonly used algorithm for computing the CPD. In

each iteration we ﬁrst ﬁx B and C and solve for

A via

A = min

||X

(1)

−

A(C  B)

The least squares problem is minimized by

A = X

(1)

(C  B)(C

C ∗ B

†

where M

†

is the pseudo-inverse of M. (C

C ∗ B

B) is an

F ×F matrix, so computing its pseudo-inverse is a minor

computation relative to X

(1)

(C B). Once

A is computed,

and

C are then solved for similarly. The process is repeated

until convergence.

C. Matricized Tensor Times Khatri-Rao Product

We denote M = X

(1)

(C  B) as MTTKRP (matricized

tensor times Khatri-Rao product). MTTKRP is executed once

per mode per iteration of ALS. For simplicity of notation

and space, we only write MTTKRP in terms of operating on

the ﬁrst mode. We deﬁne M to have I rows and B and C

to have J and K rows, respectively. M, B and C all have

F columns.

MTTKRP is often the bottleneck of computing the CPD.

Even though M is only an I×F matrix, C  B is a

dense JK×F matrix which can occupy signiﬁcantly more

memory than X . The size and the cost of forming C  B

is prohibitive for all but the smallest sparse tensors. An

efﬁcient MTTKRP implementation is essential for large-scale

tensor operations and C  B cannot by explicitly formed in

practice.

III. RELATED WORK

Over the years a number of approaches have been devel-

oped for computing the MTTKRP. The most efﬁcient of these

methods operate in O(m) time, but differ in implementation

difﬁculty, cache utilization, and opportunities for parallelism.

A. Sparse Tensor-Vector Products

Each column of M is a linear combination of the ﬁbers

of X with columns of B and C. MTTKRP can be formulated

as a series of F tensor-vector products [8]. Using tensor-

vector products is the chosen method for several major

MATLAB implementations such as Tensor Toolbox [6] and

Tensorlab [10].

A three-mode tensor requires two tensor-vector products

per column of M. A temporary array t of size m is used to

“stretch” the vectors B(:, f) and C(:, f) to map to nonzeros

of X . For each column f, the two tensor-vector products are

performed at once and stored within t. Once t is ﬁlled, we

need to “shrink” it to a vector of length I. This essentially

sums all of the entries of t that correspond to nonzeros

X (i, :, :) and stores the result as M(i, f). Algorithm 1

presents pseudocode for computing tensor-vector products.

Using sparse tensor-vector products uses 3mF FLOPs

(2mF for the initial products and mF for the accumula-

tion steps) and m intermediate memory words for t. Each

Algorithm 1 MTTKRP via Sparse Tensor-Vector products.

Input: nonzeros of X and respective I, J, and K indices

Output: M

for f ← 0 to F do

for z ← 0 to m do  Vector products

t[z] ← vals[z]B(indJ[z], f )C(indK[z], f)

end for

for z ← 0 to m do  Accumulate M(:, f )

M(indI[z], f ) ← M(indI[z], f) + t[z]

end for

nonzero can be processed in parallel during the “stretch”

stage because a nonzero will only modify a single element

of t. The accumulation step does not have this guarantee,

however, and must be executed serially. An advantage of this

method is that X does not require any special data structure

and it can be implemented in just a few lines of code in

MATLAB.

For a more in-depth overview of various tensor products,

we refer the reader to the work of Bader and Kolda [8], [7].

B. GigaTensor

GigaTensor [5] is a parallel CPD-ALS algorithm devel-

oped for the MapReduce [11] paradigm. GigaTensor utilizes

the massive parallelism of MapReduce by reformulating

MTTKRP as a series of Hadamard products. There are no

dependencies during a Hadamard product and each element

of the output can be computed in parallel.

GigaTensor avoids the construction of CB by separately

computing the contributions of B(:, f) and C(:, f) with X

(1)

via two Hadamard products. After computing the separate

contributions, the results are combined via a third Hadamard

product. The resulting matrix N has the same sparsity pattern

as X

(1)

and each nonzero entry N(i, y) is equal to

N(i, y) = X

(1)

(i, y)B(y%J, f)C(y/J, f). (1)

After computing the entries of N, the rows of the resulting

matrix are summed to form a column of M. The total process

requires 5mF FLOPs and m + max(J, K) intermediate

memory.

C. DFacTo

DFacTo [12] is a recent algorithm designed for distributed

tensor factorization. DFacTo uses an efﬁcient MTTKRP al-

gorithm that is posed as a series of sparse matrix-vector

multiplications (SpMVs). M is computed one column at a

time and each column is formed by two SpMVs. DFacTo

ﬁrst forms X

(2)

, an IK×J sparse matrix whose rows consist

of the mode-2 ﬁbers of X . When forming column M(:, f )

we ﬁrst compute X

(2)

B(:, f ) and store the result in the vals

ﬁeld of M

, an I×K sparse matrix. Finally, we compute

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

Figures

Citations

Tensor Decomposition for Signal Processing and Machine Learning

Tensors for Data Mining and Data Fusion: Models, Applications, and Scalable Algorithms

The tensor algebra compiler

A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization

Parallel Tensor Compression for Large-Scale Scientific Data

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Tensor Decompositions and Applications

Analysis of individual differences in multidimensional scaling via an n-way generalization of 'eckart-young' decomposition

Toward an architecture for never-ending language learning

Related Papers (5)

Tensor Decompositions and Applications

Efficient MATLAB Computations with Sparse and Factored Tensors

GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Analysis of individual differences in multidimensional scaling via an n-way generalization of 'eckart-young' decomposition

Tensor Decomposition for Signal Processing and Machine Learning

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Splatt: efficient and parallel sparse tensor-matrix multiplication" ?

Q2. What have the authors stated for future works in "Splatt: efficient and parallel sparse tensor-matrix multiplication" ?

Q3. What is the purpose of the parallel version of SPLATT?

Q4. What is the drawback of a mode-dependent reordering?

Q5. What is the objective of a mode-independent reordering?

Q6. What is the advantage of a shared address space?

Q7. What is the way to speed up the execution of SPLATT?

Q8. What is the corresponding indices for each fiber?

Q9. What is the common method of computing the least squares problem?

Q10. What is the way to store fibers?

Q11. What is the tensor of the brainq dataset?

Q12. What is the simplest way to divide the mode-1 indices into tubes?

Q13. what is the simplest way to factor out the X(1) columns?

Q14. What is the simplest way to compute a tensor?

Q15. What is the problem with the reordering of the mode?

Q16. How many partitions are used to model memory accesses?