What have the authors stated for future works in "Scalable sparse tensor decompositions in distributed memory systems" ?

The authors will investigate this in the future. The authors plan to update their codes and do a comparison in the near future. The authors also note that the size of the hypergraphs that they build can cause discomfort to all existing partitioning tools.

How many slices of the first mode should be allocated to the processes?

In order to achieve load balance, one should partition the slices of the first mode equitably by taking the number of nonzeros into account.

What is the definition of a fiber in a tensor?

A fiber in a tensor is defined by fixing every index but one, e.g., if X is a third-order tensor, X:,j,k is a mode-1 fiber and Xi,j,: is a mode-3 fiber.

What is the method used to partition the tensor?

The method ht-finegrain-random partitions the tensor nonzeros as well as the rows of the factor matrices randomly to establish load balance.

What is the way to implement the MTTKRP method?

Assuming that every process has the required rows of the factor matrices while executing MTTKRP for the first mode, it is advisable to implement the MTTKRP in such a way that its output MA, after transformed into A, is communicated.

What is the main obstacle for further scalability of the fastest proposed method?

In their analysis and experiments, the authors identified the communication latency as the dominant hindrance for further scalability of the fastest proposed method.

How can the fine-grain MTTKRP achieve the performance?

The experiments showed that the proposed fine-grain MTTKRP can achieve the best performance with respect to other alternatives with a good partitioning, reaching up to 194x speedups on 512 cores.

What is the speedup of the Netflix tensor?

The authors first observe in Figure 2a that on the Netflix tensor ht-finegrain-hp clearly outperforms all other methods by achieving a speedup of 194x with 512 cores over a sequential execution, whereas ht-coarsegrain-hp, ht-coarsegrain-block, DFacTo, and ht-finegrain-random could only yield to 69x, 63x, 49x, and 40x speedups, respectively.

How many iterations did you run on the dataset?

The authors let the CPALS implementations run for 20 iterations on each data with R = 10, and record the average time spent per iteration.

(Open Access) Scalable sparse tensor decompositions in distributed memory systems (2015) | Oguz Kaya

Q: What are the contributions in "Scalable sparse tensor decompositions in distributed memory systems" ?

The authors investigate an efficient parallelization of the most common iterative sparse tensor decomposition algorithms on distributed memory systems. The authors investigate a fine and a coarse-grain task definition for this operation, and propose hypergraph partitioning-based methods for these task definitions to achieve the load balance as well as reduce the communication requirements. The authors use this library to test the proposed implementation of MTTKRP in CP decomposition context, and report scalability results up to 1024 MPI ranks. The authors observed up to 194 fold speedups using 512 MPI processes on a well-known real world data, and significantly better performance results with respect to a state of the art implementation. The authors also design a distributed memory sparse tensor library, HyperTensor, which implements a well-known algorithm for the CANDECOMP/PARAFAC ( CP ) tensor decomposition using the task definitions and the associated partitioning methods.

HAL Id: hal-01148202

https://hal.inria.fr/hal-01148202v2

Submitted on 14 Dec 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Scalable sparse tensor decompositions in distributed

memory systems

Oguz Kaya, Bora Uçar

To cite this version:

Oguz Kaya, Bora Uçar. Scalable sparse tensor decompositions in distributed memory systems. In-

ternational Conference for High Performance Computing, Networking, Storage and Analysis (SC15),

Nov 2015, Austin, TX, United States. �10.1145/2807591.2807624�. �hal-01148202v2�

Scalable Sparse Tensor Decompositions

in Distributed Memory Systems

[Technical Paper]

Oguz Kaya

INRIA and LIP (UMR 5668 CNRS, ENS Lyon,

UCB Lyon 1, Inria) ENS Lyon, France

oguz.kaya@ens-lyon.fr

Bora Uçar

CNRS and LIP (UMR 5668 CNRS, ENS Lyon,

UCB Lyon 1, Inria) ENS Lyon, France

bora.ucar@ens-lyon.fr

ABSTRACT

We investigate an eﬃcient parallelization of the most com-

mon iterative sparse tensor decomposition algorithms on

distributed memory systems. A key operation in each it-

eration of these algorithms is the matricized tensor times

Khatri-Rao product (MTTKRP). This operation amounts

to element-wise vector multiplication and reduction depend-

ing on the sparsity of the tensor. We investigate a ﬁne and

a coarse-grain task deﬁnition for this operation, and pro-

pose hypergraph partitioning-based methods for these task

deﬁnitions to achieve the load balance as well as reduce

the communication requirements. We also design a dis-

tributed memory sparse tensor library, HyperTensor, which

implements a well-known algorithm for the CANDECOMP-

/PARAFAC (CP) tensor decomposition using the task deﬁ-

nitions and the associated partitioning methods. We use this

library to test the proposed implementation of MTTKRP in

CP decomposition context, and report scalability results up

to 1024 MPI ranks. We observed up to 194 fold speedups

using 512 MPI processes on a well-known real world data,

and signiﬁcantly better performance results with respect to

a state of the art implementation.

Categories and Subject Descriptors

G.1.0 [Numerical Analysis]: General—Numerical algo-

rithms, Parallel algorithms; G.2.2 [Discrete Mathemat-

ics]: Graph Theory—Hypergraphs; G.4 [Mathematical Soft-

ware]: Algorithm design and analysis, Parallel and vector

implementations

1. INTRODUCTION

Tensors or multi-dimensional arrays arise in many ﬁelds,

including analysis of Web graphs [25], knowledge bases [9],

product reviews at online stores [7], chemometrics [3], signal

processing [16], computer vision [34], and more. Tensor de-

composition algorithms are used as an important tool to un-

derstand the tensors and glean hidden or latent information

SC ’15, November 15-20, 2015, Austin, TX, USA

ACM ISBN 978-1-4503-2138-9.

DOI: http://dx.doi.org/10.1145/2807591.2807624

from them. Considerable eﬀorts are being put in design-

ing numerical algorithms for diﬀerent tensor decomposition

problems (see a short [15] and a long survey [26]), and al-

gorithmic and software contributions go hand in hand with

these eﬀorts [2, 6, 14, 23, 31].

One of the most common tensor decomposition approaches

is the CANDECOMP/PARAFAC (CP) formulation, which

approximates a given tensor as a sum of rank-one tensors.

Two most common methods for computing a CP decompo-

sition are (i) CP-ALS [10, 20], which is based on the alter-

nating least squares method; and (ii) CP-OPT [1], which

is based on the gradient descent method. Both of these

methods are iterative, where the computational core of each

iteration is a special operation called the matricized tensor

times Khatri-Rao product (MTTKRP). When the input ten-

sor is sparse and N dimensional, the MTTKRP operation

amounts to element-wise multiplication of N −1 vectors and

scaled reduction of those products according to the sparsity

structure of the tensor. This computationally involved op-

eration has received recent interest for eﬃcient execution

in diﬀerent settings such as Matlab [2, 6], MapReduce [23],

shared memory [31], and distributed memory [14].

We investigate an eﬃcient parallelization of the MTTKRP

operation in distributed memory environments for sparse

tensors in the context of the CP decomposition methods

CP-ALS and CP-OPT. For this purpose, we formulate two

task deﬁnitions, a coarse-grain and a ﬁne-grain one. These

deﬁnitions are given by applying the owner-computes rule

to a coarse and a ﬁne-grain partition of the tensor nonzeros.

We deﬁne the coarse-grain partition of a tensor as a parti-

tion of one of its dimensions. In matrix terms, a coarse-grain

partition corresponds to a row-wise or a column-wise parti-

tion. Two very recent parallel algorithms DFacTo [14] and

SPLATT [31] have coarse-grain tensor partitions and hence

have coarse-grain tasks. We deﬁne the ﬁne-grain partition

of a tensor as a partition of its nonzeros. This has the same

signiﬁcance in matrix terms. Based on these two task granu-

larities, we present two parallel algorithms for the MTTKRP

operation. We address the computational load balance and

communication cost reduction problems for the two algo-

rithms, and present hypergraph partitioning-based models

to tackle these problems with oﬀ-the-shelf partitioning tools.

The MTTKRP operation also arises in the close variants

of CP-ALS and CP-OPT for some other tensor decomposi-

tion methods [28]; hence, it has been implemented as a stan-

dalone routine [5] to enable algorithm development. Once

this operation is eﬃciently done, the other parts of the de-

composition algorithms are usually straightforward. For this

reason, most of the related work on high performance tensor

decomposition algorithms focus on this particular operation.

As hinted above, the majority of our contribution is also on

the eﬃciency of the MTTKRP operation. Nonetheless, we

design a library for the parallel CP-ALS algorithm to test

the proposed MTTKRP algorithms in a suitable context and

give experimental results using this library.

The organization of the rest of the paper is as follows.

In the next section, we give the notation, describe the MT-

TKRP operation and CP-ALS method, and review the re-

lated work that inﬂuenced our eﬀorts. Hypergraph theoreti-

cal deﬁnitions are also given in this section. In Section 3, we

describe the coarse and ﬁne-grain MTTKRP algorithms, an-

alyze their eﬃcient parallelization requirements, and present

hypergraph models for reducing the parallelization overhead.

Section 4 contains experimental results, where we report

speedups and perform comparisons with a state of the art

distributed memory CP-ALS implementation.

2. BACKGROUND AND NOTATION

Bold, upper case Roman letters are used for matrices as in

A. Matrix elements are shown with the corresponding low-

ercase letters as in a

i,j

. Matrix sizes are sometimes shown

in the lower right corner, e.g., A

I×J

. Matlab notation is

used to refer to the entire rows and columns of a matrix,

e.g., A(i, :) and A(:, j) refer to the ith row and jth column

of A respectively.

2.1 Tensors

We use calligraphic font to refer to tensors, e.g., X. The

order of a tensor is the number of its dimensions, which

we denote with N. For the sake of simplicity of the no-

tation and the discussion, we describe all the notation and

the algorithms for N = 3, even though our algorithms and

implementations have no such restriction. We explicitly gen-

eralize the discussion to general order-N tensors whenever

we ﬁnd necessary. As in matrices, an element of a tensor is

denoted by a lowercase letter and subscripts corresponding

to the indices of the element, e.g., the element (i, j, k) of a

third-order tensor is x

i,j,k

. A ﬁber in a tensor is deﬁned by

ﬁxing every index but one, e.g., if X is a third-order tensor,

:,j,k

is a mode-1 ﬁber and X

i,j,:

is a mode-3 ﬁber. A slice in

a tensor is deﬁned by ﬁxing only one index, e.g., X

i,:,:

, refers

to the ith slice of X in mode 1. We use |X

i,:,:

| to denote the

number of nonzeros in X

i,:,:

Tensors can be matricized in any mode. This is achieved

by identifying a subset of the modes of a given tensor X as

the rows and the other modes of X as the columns of a ma-

trix and appropriately mapping the elements of X to those

of the resulting matrix. We will be exclusively dealing with

the matricizations of tensors along a single mode. For ex-

ample, take X ∈ R

×···×I

. Then X

(1)

denotes the mode-1

matricization of X in such a way that the rows of X

(1)

corre-

sponds to the ﬁrst mode of X and the columns corresponds

to the remaining modes. The tensor element x

,...,i

corre-

sponds to the element



, 1 +

j=2

− 1)

j−1

k=1

i

(1)

. Speciﬁcally, each column of the matrix X

(1)

becomes

a mode-1 ﬁber of the tensor X. Matricizations in the other

modes are deﬁned similarly.

Given two matrices A

×J

and B

×J

, the Kronecker

product is deﬁned as

A ⊗ B =







1,1

B · · · a

1,J

B · · · a







For A

×J

and B

×J

, the Khatri-Rao product is deﬁned

A  B =



A(:, 1) ⊗ B(:, 1) · · · A(:, J) ⊗ B(:, J)



which is of size I

× J.

For A

I×J

and B

I×J

, the Hadamard product is deﬁned as

A ∗ B =







1,1

· · · a

1,J

I,1

· · · a

I,J







The CP-decomposition of rank R (or with R components)

of a given tensor X factorizes X into a sum of R rank-one ten-

sors. For X ∈ R

I×J×K

, it yields to x

i,j,k

≈

r=1

and X ≈

r=1

◦b

◦c

, for a

∈ R

, b

∈ R

and c

∈ R

where ◦ is the outer product of the vectors. Here the matri-

ces A = [a

, . . . , a

], B = [b

, . . . , b

], and C = [c

, . . . , c

]

are called the factor matrices, or factors. For N -mode ten-

sors, we use U

, . . . , U

to refer to the factor matrices.

We are now equipped with the notation to present the Al-

ternating Least Squares (ALS) method for obtaining a rank-

R approximation of a tensor X with the CP-decomposition.

A common formulation of CP-ALS is shown in Algorithm 1

for the 3rd order tensors. At each iteration, each factor ma-

trix is recomputed while ﬁxing the other two; e.g., A ←

(1)

(C  B)(B

B ∗ C

†

. This operation is performed

in the following order: M

= X

(1)

(C  B), V = (B

B ∗

†

, and then A ← M

V. Here V is a dense matrix of

size R×R and is easy to compute. The important issue is the

eﬃcient computation of the MTTKRP operations yielding

, similarly M

= X

(2)

(A  C) and M

= X

(3)

(B  A).

The sheer size of the Khatri-Rao products makes them

impossible to compute explicitly; hence, eﬃcient MTTKRP

algorithms ﬁnd other means to carry out the MTTKRP op-

eration (see the next subsection).

Algorithm 1: CP-ALS for the 3rd order tensors

Input : X: A 3rd order tensor

R: The rank of approximation

Output: CP decomposition [[λ; A, B, C]]

repeat

A ← X

(1)

(C  B)(B

B ∗ C

†

Normalize columns of A

B ← X

(2)

(C  A)(A

A ∗ C

†

Normalize columns of B

C ← X

(3)

(B  A)(A

A ∗ B

†

Normalize columns of C and store the norms as λ

until no improvement or maximum iterations reached

2.2 Related work

SPLATT [31] is an eﬃcient implementation of the MT-

TKRP operation for sparse tensors on shared memory sys-

tems. It is our understanding that the codes are imple-

mented for the 3-mode tensors—there is no experimental

results with the higher order tensors. Their discussion in-

cludes the generalization of the techniques to higher order

tensors whenever relevant. SPLATT implements the MT-

TKRP operation based on the slices of the dimension in

which the factor is updated, e.g., on the mode-1 slices when

computing A ← X

(1)

(C  B)(B

B ∗ C

†

. Nonzeros

of the ﬁbers in a slice are multiplied with the correspond-

ing rows of B and the results are accumulated to be later

scaled with the corresponding row of C to compute the row

of A corresponding to the slice. Parallelization is done us-

ing OpenMP directives, and the load balance (in terms of

the number of nonzeros in the slices of the mode for which

MTTKRP is computed) is achieved by using the dynamic

scheduling policy. Hypergraph models are used to optimize

cache performance by reducing the number of times a row

of A, B, and C are accessed. Smith et al. [31] also use

N-partite graphs (where N is the order of the tensor) to

reorder the tensors for all dimensions. Experiments are con-

ducted on an HP ProLiant BL280c G6 server with dual 8-

core E5-2670 Xeon processors running at 2.6GHz. Smith

et al. implement a sparse-tensor vector product algorithm

called TVec for carrying out the MTTKRP operation and

report speedups with respect to this algorithm. They report

3.7x speedup on the serial execution and 29.8x speedup on

16-way parallel execution of the SPLATT with respect to

the TVec.

GigaTensor [23] is an implementation of CP-ALS which

follows the Map-Reduce paradigm. All important steps (the

MTTKRPs and the computations of the B

B ∗ C

C) are

performed using this paradigm. A distinct advantage of Gi-

gaTensor is that thanks to Map-Reduce, the issues of fault-

tolerance, load balance, and out of core tensor data are au-

tomatically handled. On a real world data, speedup studies

with upto 100 machines (each machine has 2 quad-core In-

tel 2.83 GHz CPUs) are presented, where the speedup with

100 machines is 1.4 times the speedup with 25 machines.

One iteration of CP-ALS as implemented in GigaTensor

takes more than 10

seconds for a random tensor of size

×10

with 10

/50 nonzeros on 35 machines, eventu-

ally reaching between 10

and 10

seconds for 10

×10

with 10

/50 nonzeros. The presentation [23] of GigaTensor

focuses on three-mode tensors and expresses the map and

the reduce functions for this case. To the best of our under-

standing, additional map and reduce functions are needed

for higher order tensors, which would incur overheads.

DFacTo [14] is a distributed memory implementation of

the MTTKRP operation. It performs two successive sparse

matrix-vector multiplies (SpMV) to compute a column of

the product X

(1)

(C  B). A crucial observation (made also

elsewhere [26]) is that this operation can be implemented as

(2)

B(:, r), which can be reshaped into a matrix to be multi-

plied with C(:, r) to form the rth column of the MTTKRP.

Although SpMV is a well-investigated operation, there is

a peculiarity here: the result of the ﬁrst SpMV forms the

values of the sparse matrix used in the second one. There-

fore, there are sophisticated data dependencies between the

two SpMVs. Notice that DFacTo is rich in SpMV opera-

tions: there are two SpMVs per factorization rank per di-

mension of the input tensor. DFacTo needs to store the

tensor matricized in all dimensions, i.e., X

(1)

, . . . , X

(N)

. In

low dimensions, this can be a slight memory overhead; yet

in higher dimensions the overhead could be non-negligible.

DFacTo uses MPI for parallelization yet fully store the fac-

tor matrices in all MPI ranks. The rows of X

(1)

, . . . , X

(N)

are blockwise distributed (statically). With this partition,

each process computes the corresponding rows of the fac-

tor matrices. Finally, DFacTo performs an MPI

Allgatherv

operation to communicate the new results to all processes,

which results in (I

/P ) log

P communication volume per

process (assuming a hypercube algorithm) when computing

the n

factor matrix having I

rows using P processes. Ex-

periments are presented on machines equipped with two 2.1

GHz 12-core AMD 6172 processors where up to 32 machines

are used. In sequential runs, DFacTo is shown to be 5 times

faster than GigaTensor and 10 times faster than a MATLAB

implementation [5]. On a real world data set, DFacTo ob-

tains about 3.5 speedup on 32 machines with respect to an

execution on four machines.

Tensor Toolbox [6] is a MATLAB toolbox for handling

tensors. It provides many essential operations and enables

fast and eﬃcient realizations of complex algorithms in MAT-

LAB for sparse tensors [5]. Among those operations, MT-

TKRP implementations are provided and used in CP-ALS

method. Here, each column of the output is computed

by performing N − 1 sparse tensor vector multiplications.

Another well-known MATLAB toolbox is the N-way tool-

box [2] which is essentially for dense tensors and incorporates

now support for sparse tensors [1] through Tensor Toolbox.

Tensor Toolbox and the related software provide excellent

means for rapid prototyping of algorithms and also eﬃcient

programs for tensor operations that can be handled within

MATLAB.

2.3 Hypergraphs and hypergraph partitioning

A hypergraph H = (V, E) is deﬁned as a set of vertices

V and a set of hyperedges E. Each hyperedge is a set of

vertices. The vertices of a hypergraph can be associated

with weights denoted by w[·], and the hyperedges can be

associated with costs denoted by c[·]. For a given integer

K ≥ 2, a K-way vertex partition of a hypergraph H =

(V, E) is denoted as Π = {V

, . . . , V

}, where the parts are

non-empty, mutually exclusive, V

∩ V

= ∅ for k 6= `; and

collectively exhaustive, V =

Let W

v∈V

be the total weight in V

and W

avg

v∈V

w[v]/K be the average part weight. If each part V

∈

Π satisﬁes the balance criterion

≤ W

avg

(1 + ε), for k = 1, 2, . . . , K (1)

we say that Π is balanced where ε represents the maximum

allowed imbalance ratio.

In a partition Π, a hyperedge that has at least one vertex

in a part is said to connect that part. The number of parts

connected by a hyperedge h, i.e., connectivity, is denoted as

. Given a vertex partition Π of a hypergraph H = (V, E),

one can measure the size of the cut induced by Π as

χ(Π) =

h∈E

c[h](λ

− 1) . (2)

This cut measure is called the connectivity-1 cutsize metric.

Given ε > 0 and an integer K > 1, the standard hyper-

graph partitioning problem is deﬁned as the task of ﬁnd-

ing a balanced partition Π with K parts such that χ(Π)

is minimized. The hypergraph partitioning problem is NP-

hard [27].

A recent variant of the above problem is the multi-constraint

hypergraph partitioning [13, 24]. In this variant, each vertex

has an associated vector of weights. The partitioning objec-

tive is the same as above, and the partitioning constraint is

to satisfy a balancing constraint for each weight. Let w[v, i]

denote the C weights of a vertex v for i = 1, . . . , C. In this

variant, the balance criterion (1) is rewritten as

k,i

≤ W

avg ,i

(1+ε) for k = 1, . . . , K and i = 1, . . . , C (3)

where the ith weight W

k,i

of a part V

is deﬁned as the sum

of the ith weights of the vertices in that part (i.e., W

k,i

v∈V

w[v, i]), W

avg,i

is the average part weight for the ith

weight of all vertices (i.e., W

avg ,i

v∈V

w[v, i]/K), and ε

again represents the allowed imbalance ratio.

3. PARALLELIZATION

A common approach in implementing the MTTKRP is

to explicitly matricize a tensor across all modes, and then

perform the Khatri-Rao product using the matricized ten-

sors [14, 31]. Matricizing a tensor in a mode i requires col-

umn index values up to

k6=i

, which can exceed the inte-

ger value limits supported by modern architectures when us-

ing tensors of higher order and very large dimensions. Also,

matricizing across all modes results in N replications of a

tensor; which can exceed the memory limitations. Hence,

in order to be able to handle large tensors we store them

in coordinate format for the MTTKRP operation, which is

also the method of choice in Tensor Toolbox [6].

With a tensor stored in the coordinate format, MTTKRP

operation can be performed as shown in Algorithm 2. As

seen on Line 1 of this algorithm, a row of B and a row of

C are retrieved, and their Hadamard product is computed

and scaled with a tensor entry to update a row of M

. In

general, for an N-mode tensor

, :) ← M

, :)+x

...,i

, :)∗· · ·∗U

, :)]

is computed. Here, indices of the corresponding rows of the

factor matrices and M

coincide with indices of the unique

tensor entry of the operation.

Algorithm 2: MTTKRP for the 3rd order tensors

Input : X: tensor

B, C: Factor matrices in all modes except the ﬁrst

: Number of rows of the factor A

R: Rank of the factors

Output: M

= X

(1)

(B  C)

Initialize M

to zeros of size I

× R

foreach x

i,j,k

∈ X do

1 M

(i, :) ← M

(i, :) + x

i,j,k

[B(j, :) ∗ C(k, :)]

As factor matrices are accessed row-wise, we deﬁne com-

putational units in terms of the rows of factor matrices. It

follows naturally to partition all factor matrices row-wise

and use the same partition for the MTTKRP operation for

each mode of an input tensor across all CP-ALS iterations

to prevent extra communication. A crucial issue is the task

deﬁnitions, as this pertains to the issues of load balancing

and communication. We identify a coarse-grain and a ﬁne-

grain task deﬁnition for this computational kernel.

In the coarse-grain task deﬁnition, ith atomic task con-

sists of computing the row M

(i, :) using the nonzeros in

the tensor slice X

i,:,:

and the rows of B and C correspond-

ing to the nonzeros in that slice. The input tensor X does

not change throughout the iterations of tensor decomposi-

tion algorithms; hence it is viable to make the whole slice

i,:,:

available to the process holding M

(i, :) so that the

MTTKRP operation can be performed by only communi-

cating the rows of B and C. Yet, as CP-ALS requires the

MTTKRP in all modes, and each nonzero x

i,j,k

belongs to

slices X(i, :, :), X(:, j, :), and X(:, :, k), we need to replicate

tensor entries in the owner processes of these slices. This

may require up to N times replication of the tensor, depend-

ing on its partitioning. Note that an explicit matricization

always requires exactly N replications of tensor entries.

In the ﬁne-grain task deﬁnition, an atomic task corre-

sponds to the multiplication of a tensor entry with the Hadamard

product of the corresponding rows of B and C. Here, tensor

nonzeros are partitioned among processes with no replication

to induce a task partition by following the owner-computes

rule. This necessitates communicating the rows of B and C

that are needed by these atomic tasks. Furthermore, par-

tial results on the rows of M

need to be communicated,

as without duplicating tensor entries, we cannot in general

compute all contributions to a row of M

. Here, the par-

tition of X should be useful in all modes, as the CP-ALS

method requires the MTTKRP in all modes.

The coarse-grain task deﬁnition resembles to the one-dimen-

sional (1D) row-wise (or column-wise) partitioning of sparse

matrices, whereas the ﬁne-grain one resembles to the two-

dimensional (nonzero-based) partitioning of sparse matrices

for parallel sparse matrix-vector multiply (SpMV) opera-

tions. As is conﬁrmed for SpMV in modern applications,

1D partitioning usually leads to harder problems of load bal-

ancing and communication cost reduction. The same phe-

nomenon is likely to be observed in tensors as well. Nonethe-

less, we cover the coarse-grain task deﬁnition, as it is used

in the state of the art parallel MTTKRP [14, 31] methods,

which partition the matricized tensor row-wise (or equiva-

lently, partition the input tensor by slices).

3.1 Coarse-grain task model

In the coarse-grain task model, computing the rows of

, M

, and M

are deﬁned as the atomic tasks which

are partitioned across all processes. Let µ

denote the par-

tition of the ﬁrst mode’s indices among the processes, i.e.,

(i) = p, if the process p is responsible for computing

(i, :). Similarly, let µ

and µ

deﬁne the partition of

the second and the third mode indices. The process owning

(i, :) needs the entire tensor slice X

i,:,:

; similarly, the pro-

cess owning M

(j) needs X

:,j,:

and the owner of M

(k, :)

needs X

:,:,k

. This necessitates duplication of some tensor

nonzeros to prevent unnecessary communication.

One needs to take the context of CP-ALS into account

when parallelizing the MTTKRP method. First, the output

of MTTKRP is transformed into A. Since A(i, :) is

computed simply by multiplying M

(i, :) with the matrix

B ∗ C

†

, we make the process which owns M

(i, :)

responsible for computing A(i, :). Second, N MTTKRP op-

erations follow one another in an iteration. Assuming that

every process has the required rows of the factor matrices

while executing MTTKRP for the ﬁrst mode, it is advisable

to implement the MTTKRP in such a way that its out-

put M

, after transformed into A, is communicated. This

way, all processes would have the necessary data for execut-

ing the MTTKRP for the next mode. With these in mind,

the coarse-grain parallel MTTKRP method executes Algo-

rithm 3 at each process p.

Scalable sparse tensor decompositions in distributed memory systems

Figures

Citations

The tensor algebra compiler

Parallel Tensor Compression for Large-Scale Scientific Data

Tensor-matrix products with a compressed sparse tensor

HiCOO: hierarchical storage of sparse tensors

Accelerating the Tucker Decomposition with Compressed Sparse Tensors

References

Johnson: Computers and Intractability-A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness

Tensor Decompositions and Applications

Analysis of individual differences in multidimensional scaling via an n-way generalization of 'eckart-young' decomposition

Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis

Related Papers (5)

Tensor Decompositions and Applications

SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication

GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Efficient MATLAB Computations with Sparse and Factored Tensors

Toward an architecture for never-ending language learning

Frequently Asked Questions (12)

Q1. What are the contributions in "Scalable sparse tensor decompositions in distributed memory systems" ?

Q2. What have the authors stated for future works in "Scalable sparse tensor decompositions in distributed memory systems" ?

Q3. How many slices of the first mode should be allocated to the processes?

Q4. What is the definition of a fiber in a tensor?

Q5. What is the method used to partition the tensor?

Q6. What are the recent parallel algorithms?

Q7. What is the way to implement the MTTKRP method?

Q8. What is the main obstacle for further scalability of the fastest proposed method?

Q9. How many machines are used in the speedup study?

Q10. How can the fine-grain MTTKRP achieve the performance?

Q11. What is the speedup of the Netflix tensor?

Q12. How many iterations did you run on the dataset?