scispace - formally typeset
Open AccessProceedings ArticleDOI

Scalable sparse tensor decompositions in distributed memory systems

Reads0
Chats0
TLDR
A distributed memory sparse tensor library, HyperTensor, is designed, which implements a well-known algorithm for the CANDECOMP-/PARAFAC (CP) tensor decomposition using the task definitions and the associated partitioning methods.
Abstract
We investigate an efficient parallelization of the most common iterative sparse tensor decomposition algorithms on distributed memory systems. A key operation in each iteration of these algorithms is the matricized tensor times Khatri-Rao product (MTTKRP). This operation amounts to element-wise vector multiplication and reduction depending on the sparsity of the tensor. We investigate a fine and a coarse-grain task definition for this operation, and propose hypergraph partitioning-based methods for these task definitions to achieve the load balance as well as reduce the communication requirements. We also design a distributed memory sparse tensor library, HyperTensor, which implements a well-known algorithm for the CANDECOMP-/PARAFAC (CP) tensor decomposition using the task definitions and the associated partitioning methods. We use this library to test the proposed implementation of MTTKRP in CP decomposition context, and report scalability results up to 1024 MPI ranks. We observed up to 194 fold speedups using 512 MPI processes on a well-known real world data, and significantly better performance results with respect to a state of the art implementation.

read more

Content maybe subject to copyright    Report

HAL Id: hal-01148202
https://hal.inria.fr/hal-01148202v2
Submitted on 14 Dec 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Scalable sparse tensor decompositions in distributed
memory systems
Oguz Kaya, Bora Uçar
To cite this version:
Oguz Kaya, Bora Uçar. Scalable sparse tensor decompositions in distributed memory systems. In-
ternational Conference for High Performance Computing, Networking, Storage and Analysis (SC15),
Nov 2015, Austin, TX, United States. �10.1145/2807591.2807624�. �hal-01148202v2�

Scalable Sparse Tensor Decompositions
in Distributed Memory Systems
[Technical Paper]
Oguz Kaya
INRIA and LIP (UMR 5668 CNRS, ENS Lyon,
UCB Lyon 1, Inria) ENS Lyon, France
oguz.kaya@ens-lyon.fr
Bora Uçar
CNRS and LIP (UMR 5668 CNRS, ENS Lyon,
UCB Lyon 1, Inria) ENS Lyon, France
bora.ucar@ens-lyon.fr
ABSTRACT
We investigate an efficient parallelization of the most com-
mon iterative sparse tensor decomposition algorithms on
distributed memory systems. A key operation in each it-
eration of these algorithms is the matricized tensor times
Khatri-Rao product (MTTKRP). This operation amounts
to element-wise vector multiplication and reduction depend-
ing on the sparsity of the tensor. We investigate a fine and
a coarse-grain task definition for this operation, and pro-
pose hypergraph partitioning-based methods for these task
definitions to achieve the load balance as well as reduce
the communication requirements. We also design a dis-
tributed memory sparse tensor library, HyperTensor, which
implements a well-known algorithm for the CANDECOMP-
/PARAFAC (CP) tensor decomposition using the task defi-
nitions and the associated partitioning methods. We use this
library to test the proposed implementation of MTTKRP in
CP decomposition context, and report scalability results up
to 1024 MPI ranks. We observed up to 194 fold speedups
using 512 MPI processes on a well-known real world data,
and significantly better performance results with respect to
a state of the art implementation.
Categories and Subject Descriptors
G.1.0 [Numerical Analysis]: General—Numerical algo-
rithms, Parallel algorithms; G.2.2 [Discrete Mathemat-
ics]: Graph Theory—Hypergraphs; G.4 [Mathematical Soft-
ware]: Algorithm design and analysis, Parallel and vector
implementations
1. INTRODUCTION
Tensors or multi-dimensional arrays arise in many fields,
including analysis of Web graphs [25], knowledge bases [9],
product reviews at online stores [7], chemometrics [3], signal
processing [16], computer vision [34], and more. Tensor de-
composition algorithms are used as an important tool to un-
derstand the tensors and glean hidden or latent information
SC ’15, November 15-20, 2015, Austin, TX, USA
ACM ISBN 978-1-4503-2138-9.
DOI: http://dx.doi.org/10.1145/2807591.2807624
from them. Considerable efforts are being put in design-
ing numerical algorithms for different tensor decomposition
problems (see a short [15] and a long survey [26]), and al-
gorithmic and software contributions go hand in hand with
these efforts [2, 6, 14, 23, 31].
One of the most common tensor decomposition approaches
is the CANDECOMP/PARAFAC (CP) formulation, which
approximates a given tensor as a sum of rank-one tensors.
Two most common methods for computing a CP decompo-
sition are (i) CP-ALS [10, 20], which is based on the alter-
nating least squares method; and (ii) CP-OPT [1], which
is based on the gradient descent method. Both of these
methods are iterative, where the computational core of each
iteration is a special operation called the matricized tensor
times Khatri-Rao product (MTTKRP). When the input ten-
sor is sparse and N dimensional, the MTTKRP operation
amounts to element-wise multiplication of N 1 vectors and
scaled reduction of those products according to the sparsity
structure of the tensor. This computationally involved op-
eration has received recent interest for efficient execution
in different settings such as Matlab [2, 6], MapReduce [23],
shared memory [31], and distributed memory [14].
We investigate an efficient parallelization of the MTTKRP
operation in distributed memory environments for sparse
tensors in the context of the CP decomposition methods
CP-ALS and CP-OPT. For this purpose, we formulate two
task definitions, a coarse-grain and a fine-grain one. These
definitions are given by applying the owner-computes rule
to a coarse and a fine-grain partition of the tensor nonzeros.
We define the coarse-grain partition of a tensor as a parti-
tion of one of its dimensions. In matrix terms, a coarse-grain
partition corresponds to a row-wise or a column-wise parti-
tion. Two very recent parallel algorithms DFacTo [14] and
SPLATT [31] have coarse-grain tensor partitions and hence
have coarse-grain tasks. We define the fine-grain partition
of a tensor as a partition of its nonzeros. This has the same
significance in matrix terms. Based on these two task granu-
larities, we present two parallel algorithms for the MTTKRP
operation. We address the computational load balance and
communication cost reduction problems for the two algo-
rithms, and present hypergraph partitioning-based models
to tackle these problems with off-the-shelf partitioning tools.
The MTTKRP operation also arises in the close variants
of CP-ALS and CP-OPT for some other tensor decomposi-
tion methods [28]; hence, it has been implemented as a stan-
dalone routine [5] to enable algorithm development. Once
this operation is efficiently done, the other parts of the de-

composition algorithms are usually straightforward. For this
reason, most of the related work on high performance tensor
decomposition algorithms focus on this particular operation.
As hinted above, the majority of our contribution is also on
the efficiency of the MTTKRP operation. Nonetheless, we
design a library for the parallel CP-ALS algorithm to test
the proposed MTTKRP algorithms in a suitable context and
give experimental results using this library.
The organization of the rest of the paper is as follows.
In the next section, we give the notation, describe the MT-
TKRP operation and CP-ALS method, and review the re-
lated work that influenced our efforts. Hypergraph theoreti-
cal definitions are also given in this section. In Section 3, we
describe the coarse and fine-grain MTTKRP algorithms, an-
alyze their efficient parallelization requirements, and present
hypergraph models for reducing the parallelization overhead.
Section 4 contains experimental results, where we report
speedups and perform comparisons with a state of the art
distributed memory CP-ALS implementation.
2. BACKGROUND AND NOTATION
Bold, upper case Roman letters are used for matrices as in
A. Matrix elements are shown with the corresponding low-
ercase letters as in a
i,j
. Matrix sizes are sometimes shown
in the lower right corner, e.g., A
I×J
. Matlab notation is
used to refer to the entire rows and columns of a matrix,
e.g., A(i, :) and A(:, j) refer to the ith row and jth column
of A respectively.
2.1 Tensors
We use calligraphic font to refer to tensors, e.g., X. The
order of a tensor is the number of its dimensions, which
we denote with N. For the sake of simplicity of the no-
tation and the discussion, we describe all the notation and
the algorithms for N = 3, even though our algorithms and
implementations have no such restriction. We explicitly gen-
eralize the discussion to general order-N tensors whenever
we find necessary. As in matrices, an element of a tensor is
denoted by a lowercase letter and subscripts corresponding
to the indices of the element, e.g., the element (i, j, k) of a
third-order tensor is x
i,j,k
. A fiber in a tensor is defined by
fixing every index but one, e.g., if X is a third-order tensor,
X
:,j,k
is a mode-1 fiber and X
i,j,:
is a mode-3 fiber. A slice in
a tensor is defined by fixing only one index, e.g., X
i,:,:
, refers
to the ith slice of X in mode 1. We use |X
i,:,:
| to denote the
number of nonzeros in X
i,:,:
.
Tensors can be matricized in any mode. This is achieved
by identifying a subset of the modes of a given tensor X as
the rows and the other modes of X as the columns of a ma-
trix and appropriately mapping the elements of X to those
of the resulting matrix. We will be exclusively dealing with
the matricizations of tensors along a single mode. For ex-
ample, take X R
I
1
×···×I
N
. Then X
(1)
denotes the mode-1
matricization of X in such a way that the rows of X
(1)
corre-
sponds to the first mode of X and the columns corresponds
to the remaining modes. The tensor element x
i
1
,...,i
N
corre-
sponds to the element
i
1
, 1 +
P
N
j=2
h
(i
j
1)
Q
j1
k=1
I
k
i
of
X
(1)
. Specifically, each column of the matrix X
(1)
becomes
a mode-1 fiber of the tensor X. Matricizations in the other
modes are defined similarly.
Given two matrices A
I
1
×J
1
and B
I
2
×J
2
, the Kronecker
product is defined as
A B =
a
1,1
B · · · a
1,J
1
B
.
.
.
.
.
.
.
.
.
a
I
1
,1
B · · · a
I
1
,J
1
B
.
For A
I
1
×J
and B
I
2
×J
, the Khatri-Rao product is defined
as
A B =
A(:, 1) B(:, 1) · · · A(:, J) B(:, J)
,
which is of size I
1
I
2
× J.
For A
I×J
and B
I×J
, the Hadamard product is defined as
A B =
a
1,1
b
1,1
· · · a
1,J
b
1,J
.
.
.
.
.
.
.
.
.
a
I,1
b
I,1
· · · a
I,J
b
I,J
.
The CP-decomposition of rank R (or with R components)
of a given tensor X factorizes X into a sum of R rank-one ten-
sors. For X R
I×J×K
, it yields to x
i,j,k
P
R
r=1
a
ir
b
jr
c
kr
and X
P
R
r=1
a
r
b
r
c
r
, for a
r
R
I
, b
r
R
J
and c
r
R
K
,
where is the outer product of the vectors. Here the matri-
ces A = [a
1
, . . . , a
R
], B = [b
1
, . . . , b
R
], and C = [c
1
, . . . , c
R
]
are called the factor matrices, or factors. For N -mode ten-
sors, we use U
1
, . . . , U
N
to refer to the factor matrices.
We are now equipped with the notation to present the Al-
ternating Least Squares (ALS) method for obtaining a rank-
R approximation of a tensor X with the CP-decomposition.
A common formulation of CP-ALS is shown in Algorithm 1
for the 3rd order tensors. At each iteration, each factor ma-
trix is recomputed while fixing the other two; e.g., A
X
(1)
(C B)(B
T
B C
T
C)
. This operation is performed
in the following order: M
A
= X
(1)
(C B), V = (B
T
B
C
T
C)
, and then A M
A
V. Here V is a dense matrix of
size R×R and is easy to compute. The important issue is the
efficient computation of the MTTKRP operations yielding
M
A
, similarly M
B
= X
(2)
(A C) and M
C
= X
(3)
(B A).
The sheer size of the Khatri-Rao products makes them
impossible to compute explicitly; hence, efficient MTTKRP
algorithms find other means to carry out the MTTKRP op-
eration (see the next subsection).
Algorithm 1: CP-ALS for the 3rd order tensors
Input : X: A 3rd order tensor
R: The rank of approximation
Output: CP decomposition [[λ; A, B, C]]
repeat
A X
(1)
(C B)(B
T
B C
T
C)
Normalize columns of A
B X
(2)
(C A)(A
T
A C
T
C)
Normalize columns of B
C X
(3)
(B A)(A
T
A B
T
B)
Normalize columns of C and store the norms as λ
until no improvement or maximum iterations reached
2.2 Related work
SPLATT [31] is an efficient implementation of the MT-
TKRP operation for sparse tensors on shared memory sys-
tems. It is our understanding that the codes are imple-
mented for the 3-mode tensors—there is no experimental

results with the higher order tensors. Their discussion in-
cludes the generalization of the techniques to higher order
tensors whenever relevant. SPLATT implements the MT-
TKRP operation based on the slices of the dimension in
which the factor is updated, e.g., on the mode-1 slices when
computing A X
(1)
(C B)(B
T
B C
T
C)
. Nonzeros
of the fibers in a slice are multiplied with the correspond-
ing rows of B and the results are accumulated to be later
scaled with the corresponding row of C to compute the row
of A corresponding to the slice. Parallelization is done us-
ing OpenMP directives, and the load balance (in terms of
the number of nonzeros in the slices of the mode for which
MTTKRP is computed) is achieved by using the dynamic
scheduling policy. Hypergraph models are used to optimize
cache performance by reducing the number of times a row
of A, B, and C are accessed. Smith et al. [31] also use
N-partite graphs (where N is the order of the tensor) to
reorder the tensors for all dimensions. Experiments are con-
ducted on an HP ProLiant BL280c G6 server with dual 8-
core E5-2670 Xeon processors running at 2.6GHz. Smith
et al. implement a sparse-tensor vector product algorithm
called TVec for carrying out the MTTKRP operation and
report speedups with respect to this algorithm. They report
3.7x speedup on the serial execution and 29.8x speedup on
16-way parallel execution of the SPLATT with respect to
the TVec.
GigaTensor [23] is an implementation of CP-ALS which
follows the Map-Reduce paradigm. All important steps (the
MTTKRPs and the computations of the B
T
B C
T
C) are
performed using this paradigm. A distinct advantage of Gi-
gaTensor is that thanks to Map-Reduce, the issues of fault-
tolerance, load balance, and out of core tensor data are au-
tomatically handled. On a real world data, speedup studies
with upto 100 machines (each machine has 2 quad-core In-
tel 2.83 GHz CPUs) are presented, where the speedup with
100 machines is 1.4 times the speedup with 25 machines.
One iteration of CP-ALS as implemented in GigaTensor
takes more than 10
3
seconds for a random tensor of size
10
5
×10
5
×10
5
with 10
5
/50 nonzeros on 35 machines, eventu-
ally reaching between 10
4
and 10
5
seconds for 10
9
×10
9
×10
9
with 10
9
/50 nonzeros. The presentation [23] of GigaTensor
focuses on three-mode tensors and expresses the map and
the reduce functions for this case. To the best of our under-
standing, additional map and reduce functions are needed
for higher order tensors, which would incur overheads.
DFacTo [14] is a distributed memory implementation of
the MTTKRP operation. It performs two successive sparse
matrix-vector multiplies (SpMV) to compute a column of
the product X
(1)
(C B). A crucial observation (made also
elsewhere [26]) is that this operation can be implemented as
X
T
(2)
B(:, r), which can be reshaped into a matrix to be multi-
plied with C(:, r) to form the rth column of the MTTKRP.
Although SpMV is a well-investigated operation, there is
a peculiarity here: the result of the first SpMV forms the
values of the sparse matrix used in the second one. There-
fore, there are sophisticated data dependencies between the
two SpMVs. Notice that DFacTo is rich in SpMV opera-
tions: there are two SpMVs per factorization rank per di-
mension of the input tensor. DFacTo needs to store the
tensor matricized in all dimensions, i.e., X
(1)
, . . . , X
(N)
. In
low dimensions, this can be a slight memory overhead; yet
in higher dimensions the overhead could be non-negligible.
DFacTo uses MPI for parallelization yet fully store the fac-
tor matrices in all MPI ranks. The rows of X
(1)
, . . . , X
(N)
are blockwise distributed (statically). With this partition,
each process computes the corresponding rows of the fac-
tor matrices. Finally, DFacTo performs an MPI
Allgatherv
operation to communicate the new results to all processes,
which results in (I
n
/P ) log
2
P communication volume per
process (assuming a hypercube algorithm) when computing
the n
th
factor matrix having I
n
rows using P processes. Ex-
periments are presented on machines equipped with two 2.1
GHz 12-core AMD 6172 processors where up to 32 machines
are used. In sequential runs, DFacTo is shown to be 5 times
faster than GigaTensor and 10 times faster than a MATLAB
implementation [5]. On a real world data set, DFacTo ob-
tains about 3.5 speedup on 32 machines with respect to an
execution on four machines.
Tensor Toolbox [6] is a MATLAB toolbox for handling
tensors. It provides many essential operations and enables
fast and efficient realizations of complex algorithms in MAT-
LAB for sparse tensors [5]. Among those operations, MT-
TKRP implementations are provided and used in CP-ALS
method. Here, each column of the output is computed
by performing N 1 sparse tensor vector multiplications.
Another well-known MATLAB toolbox is the N-way tool-
box [2] which is essentially for dense tensors and incorporates
now support for sparse tensors [1] through Tensor Toolbox.
Tensor Toolbox and the related software provide excellent
means for rapid prototyping of algorithms and also efficient
programs for tensor operations that can be handled within
MATLAB.
2.3 Hypergraphs and hypergraph partitioning
A hypergraph H = (V, E) is defined as a set of vertices
V and a set of hyperedges E. Each hyperedge is a set of
vertices. The vertices of a hypergraph can be associated
with weights denoted by w[·], and the hyperedges can be
associated with costs denoted by c[·]. For a given integer
K 2, a K-way vertex partition of a hypergraph H =
(V, E) is denoted as Π = {V
1
, . . . , V
K
}, where the parts are
non-empty, mutually exclusive, V
k
V
`
= for k 6= `; and
collectively exhaustive, V =
S
V
k
.
Let W
k
=
P
vV
k
be the total weight in V
k
and W
avg
=
P
vV
w[v]/K be the average part weight. If each part V
k
Π satisfies the balance criterion
W
k
W
avg
(1 + ε), for k = 1, 2, . . . , K (1)
we say that Π is balanced where ε represents the maximum
allowed imbalance ratio.
In a partition Π, a hyperedge that has at least one vertex
in a part is said to connect that part. The number of parts
connected by a hyperedge h, i.e., connectivity, is denoted as
λ
h
. Given a vertex partition Π of a hypergraph H = (V, E),
one can measure the size of the cut induced by Π as
χ(Π) =
X
hE
c[h](λ
h
1) . (2)
This cut measure is called the connectivity-1 cutsize metric.
Given ε > 0 and an integer K > 1, the standard hyper-
graph partitioning problem is defined as the task of find-
ing a balanced partition Π with K parts such that χ(Π)
is minimized. The hypergraph partitioning problem is NP-
hard [27].
A recent variant of the above problem is the multi-constraint
hypergraph partitioning [13, 24]. In this variant, each vertex

has an associated vector of weights. The partitioning objec-
tive is the same as above, and the partitioning constraint is
to satisfy a balancing constraint for each weight. Let w[v, i]
denote the C weights of a vertex v for i = 1, . . . , C. In this
variant, the balance criterion (1) is rewritten as
W
k,i
W
avg ,i
(1+ε) for k = 1, . . . , K and i = 1, . . . , C (3)
where the ith weight W
k,i
of a part V
k
is defined as the sum
of the ith weights of the vertices in that part (i.e., W
k,i
=
P
vV
k
w[v, i]), W
avg,i
is the average part weight for the ith
weight of all vertices (i.e., W
avg ,i
=
P
vV
w[v, i]/K), and ε
again represents the allowed imbalance ratio.
3. PARALLELIZATION
A common approach in implementing the MTTKRP is
to explicitly matricize a tensor across all modes, and then
perform the Khatri-Rao product using the matricized ten-
sors [14, 31]. Matricizing a tensor in a mode i requires col-
umn index values up to
Q
N
k6=i
I
k
, which can exceed the inte-
ger value limits supported by modern architectures when us-
ing tensors of higher order and very large dimensions. Also,
matricizing across all modes results in N replications of a
tensor; which can exceed the memory limitations. Hence,
in order to be able to handle large tensors we store them
in coordinate format for the MTTKRP operation, which is
also the method of choice in Tensor Toolbox [6].
With a tensor stored in the coordinate format, MTTKRP
operation can be performed as shown in Algorithm 2. As
seen on Line 1 of this algorithm, a row of B and a row of
C are retrieved, and their Hadamard product is computed
and scaled with a tensor entry to update a row of M
A
. In
general, for an N-mode tensor
M
U
1
(i
1
, :) M
U
1
(i
1
, :)+x
i
1
,i
2
...,i
N
[U
2
(i
2
, :)∗· · ·∗U
N
(i
N
, :)]
is computed. Here, indices of the corresponding rows of the
factor matrices and M
U
1
coincide with indices of the unique
tensor entry of the operation.
Algorithm 2: MTTKRP for the 3rd order tensors
Input : X: tensor
B, C: Factor matrices in all modes except the first
I
A
: Number of rows of the factor A
R: Rank of the factors
Output: M
A
= X
(1)
(B C)
Initialize M
A
to zeros of size I
A
× R
foreach x
i,j,k
X do
1 M
A
(i, :) M
A
(i, :) + x
i,j,k
[B(j, :) C(k, :)]
As factor matrices are accessed row-wise, we define com-
putational units in terms of the rows of factor matrices. It
follows naturally to partition all factor matrices row-wise
and use the same partition for the MTTKRP operation for
each mode of an input tensor across all CP-ALS iterations
to prevent extra communication. A crucial issue is the task
definitions, as this pertains to the issues of load balancing
and communication. We identify a coarse-grain and a fine-
grain task definition for this computational kernel.
In the coarse-grain task definition, ith atomic task con-
sists of computing the row M
A
(i, :) using the nonzeros in
the tensor slice X
i,:,:
and the rows of B and C correspond-
ing to the nonzeros in that slice. The input tensor X does
not change throughout the iterations of tensor decomposi-
tion algorithms; hence it is viable to make the whole slice
X
i,:,:
available to the process holding M
A
(i, :) so that the
MTTKRP operation can be performed by only communi-
cating the rows of B and C. Yet, as CP-ALS requires the
MTTKRP in all modes, and each nonzero x
i,j,k
belongs to
slices X(i, :, :), X(:, j, :), and X(:, :, k), we need to replicate
tensor entries in the owner processes of these slices. This
may require up to N times replication of the tensor, depend-
ing on its partitioning. Note that an explicit matricization
always requires exactly N replications of tensor entries.
In the fine-grain task definition, an atomic task corre-
sponds to the multiplication of a tensor entry with the Hadamard
product of the corresponding rows of B and C. Here, tensor
nonzeros are partitioned among processes with no replication
to induce a task partition by following the owner-computes
rule. This necessitates communicating the rows of B and C
that are needed by these atomic tasks. Furthermore, par-
tial results on the rows of M
A
need to be communicated,
as without duplicating tensor entries, we cannot in general
compute all contributions to a row of M
A
. Here, the par-
tition of X should be useful in all modes, as the CP-ALS
method requires the MTTKRP in all modes.
The coarse-grain task definition resembles to the one-dimen-
sional (1D) row-wise (or column-wise) partitioning of sparse
matrices, whereas the fine-grain one resembles to the two-
dimensional (nonzero-based) partitioning of sparse matrices
for parallel sparse matrix-vector multiply (SpMV) opera-
tions. As is confirmed for SpMV in modern applications,
1D partitioning usually leads to harder problems of load bal-
ancing and communication cost reduction. The same phe-
nomenon is likely to be observed in tensors as well. Nonethe-
less, we cover the coarse-grain task definition, as it is used
in the state of the art parallel MTTKRP [14, 31] methods,
which partition the matricized tensor row-wise (or equiva-
lently, partition the input tensor by slices).
3.1 Coarse-grain task model
In the coarse-grain task model, computing the rows of
M
A
, M
B
, and M
C
are defined as the atomic tasks which
are partitioned across all processes. Let µ
A
denote the par-
tition of the first mode’s indices among the processes, i.e.,
µ
A
(i) = p, if the process p is responsible for computing
M
A
(i, :). Similarly, let µ
B
and µ
C
define the partition of
the second and the third mode indices. The process owning
M
A
(i, :) needs the entire tensor slice X
i,:,:
; similarly, the pro-
cess owning M
B
(j) needs X
:,j,:
and the owner of M
C
(k, :)
needs X
:,:,k
. This necessitates duplication of some tensor
nonzeros to prevent unnecessary communication.
One needs to take the context of CP-ALS into account
when parallelizing the MTTKRP method. First, the output
M
A
of MTTKRP is transformed into A. Since A(i, :) is
computed simply by multiplying M
A
(i, :) with the matrix
(B
T
B C
T
C)
, we make the process which owns M
A
(i, :)
responsible for computing A(i, :). Second, N MTTKRP op-
erations follow one another in an iteration. Assuming that
every process has the required rows of the factor matrices
while executing MTTKRP for the first mode, it is advisable
to implement the MTTKRP in such a way that its out-
put M
A
, after transformed into A, is communicated. This
way, all processes would have the necessary data for execut-
ing the MTTKRP for the next mode. With these in mind,
the coarse-grain parallel MTTKRP method executes Algo-
rithm 3 at each process p.

Citations
More filters
Journal ArticleDOI

The tensor algebra compiler

TL;DR: TACO as mentioned in this paper is a C++ library that automatically generates compound tensor algebra operations on dense and sparse tensors, which can be used in machine learning, data analytics, engineering and the physical sciences.
Proceedings ArticleDOI

Parallel Tensor Compression for Large-Scale Scientific Data

TL;DR: This work presents the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts.
Proceedings ArticleDOI

Tensor-matrix products with a compressed sparse tensor

TL;DR: The compressed sparse fiber (CSF) a data structure for sparse tensors along with a novel parallel algorithm for tensor-matrix multiplication is introduced and offers similar operation reductions as existing compressed methods while using only a single tensor structure.
Proceedings ArticleDOI

HiCOO: hierarchical storage of sparse tensors

TL;DR: This paper evaluates HiCOO by implementing a single-node, multicore-parallel version of the matricized tensor-times-Khatri-Rao product (MTTKRP) operation, which is the most expensive computational core in the widely used CANDECOMP/PARAFAC decomposition (CPD) algorithm.
Book ChapterDOI

Accelerating the Tucker Decomposition with Compressed Sparse Tensors

TL;DR: This work presents an algorithm based on a compressed data structure for sparse tensors and shows that many computational redundancies during TTMc can be identified and pruned without the memory overheads of memoization.
References
More filters
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Journal ArticleDOI

Tensor Decompositions and Applications

TL;DR: This survey provides an overview of higher-order tensor decompositions, their applications, and available software.
Journal ArticleDOI

Analysis of individual differences in multidimensional scaling via an n-way generalization of 'eckart-young' decomposition

TL;DR: In this paper, an individual differences model for multidimensional scaling is outlined in which individuals are assumed differentially to weight the several dimensions of a common "psychological space" and a corresponding method of analyzing similarities data is proposed, involving a generalization of Eckart-Young analysis to decomposition of three-way (or higher-way) tables.

Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis

TL;DR: It is shown that an extension of Cattell's principle of rotation to Proportional Profiles (PP) offers a basis for determining explanatory factors for three-way or higher order multi-mode data.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Scalable sparse tensor decompositions in distributed memory systems" ?

The authors investigate an efficient parallelization of the most common iterative sparse tensor decomposition algorithms on distributed memory systems. The authors investigate a fine and a coarse-grain task definition for this operation, and propose hypergraph partitioning-based methods for these task definitions to achieve the load balance as well as reduce the communication requirements. The authors use this library to test the proposed implementation of MTTKRP in CP decomposition context, and report scalability results up to 1024 MPI ranks. The authors observed up to 194 fold speedups using 512 MPI processes on a well-known real world data, and significantly better performance results with respect to a state of the art implementation. The authors also design a distributed memory sparse tensor library, HyperTensor, which implements a well-known algorithm for the CANDECOMP/PARAFAC ( CP ) tensor decomposition using the task definitions and the associated partitioning methods. 

The authors will investigate this in the future. The authors plan to update their codes and do a comparison in the near future. The authors also note that the size of the hypergraphs that they build can cause discomfort to all existing partitioning tools. 

In order to achieve load balance, one should partition the slices of the first mode equitably by taking the number of nonzeros into account. 

A fiber in a tensor is defined by fixing every index but one, e.g., if X is a third-order tensor, X:,j,k is a mode-1 fiber and Xi,j,: is a mode-3 fiber. 

The method ht-finegrain-random partitions the tensor nonzeros as well as the rows of the factor matrices randomly to establish load balance. 

Two very recent parallel algorithms DFacTo [14] and SPLATT [31] have coarse-grain tensor partitions and hence have coarse-grain tasks. 

Assuming that every process has the required rows of the factor matrices while executing MTTKRP for the first mode, it is advisable to implement the MTTKRP in such a way that its output MA, after transformed into A, is communicated. 

In their analysis and experiments, the authors identified the communication latency as the dominant hindrance for further scalability of the fastest proposed method. 

On a real world data, speedup studies with upto 100 machines (each machine has 2 quad-core Intel 2.83 GHz CPUs) are presented, where the speedup with 100 machines is 1.4 times the speedup with 25 machines. 

The experiments showed that the proposed fine-grain MTTKRP can achieve the best performance with respect to other alternatives with a good partitioning, reaching up to 194x speedups on 512 cores. 

The authors first observe in Figure 2a that on the Netflix tensor ht-finegrain-hp clearly outperforms all other methods by achieving a speedup of 194x with 512 cores over a sequential execution, whereas ht-coarsegrain-hp, ht-coarsegrain-block, DFacTo, and ht-finegrain-random could only yield to 69x, 63x, 49x, and 40x speedups, respectively. 

The authors let the CPALS implementations run for 20 iterations on each data with R = 10, and record the average time spent per iteration.