Scalable sparse tensor decompositions in distributed memory systems
read more
Citations
The tensor algebra compiler
Parallel Tensor Compression for Large-Scale Scientific Data
Tensor-matrix products with a compressed sparse tensor
HiCOO: hierarchical storage of sparse tensors
Accelerating the Tucker Decomposition with Compressed Sparse Tensors
References
Computers and Intractability: A Guide to the Theory of NP-Completeness
Tensor Decompositions and Applications
Analysis of individual differences in multidimensional scaling via an n-way generalization of 'eckart-young' decomposition
Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis
Related Papers (5)
Frequently Asked Questions (12)
Q2. What have the authors stated for future works in "Scalable sparse tensor decompositions in distributed memory systems" ?
The authors will investigate this in the future. The authors plan to update their codes and do a comparison in the near future. The authors also note that the size of the hypergraphs that they build can cause discomfort to all existing partitioning tools.
Q3. How many slices of the first mode should be allocated to the processes?
In order to achieve load balance, one should partition the slices of the first mode equitably by taking the number of nonzeros into account.
Q4. What is the definition of a fiber in a tensor?
A fiber in a tensor is defined by fixing every index but one, e.g., if X is a third-order tensor, X:,j,k is a mode-1 fiber and Xi,j,: is a mode-3 fiber.
Q5. What is the method used to partition the tensor?
The method ht-finegrain-random partitions the tensor nonzeros as well as the rows of the factor matrices randomly to establish load balance.
Q6. What are the recent parallel algorithms?
Two very recent parallel algorithms DFacTo [14] and SPLATT [31] have coarse-grain tensor partitions and hence have coarse-grain tasks.
Q7. What is the way to implement the MTTKRP method?
Assuming that every process has the required rows of the factor matrices while executing MTTKRP for the first mode, it is advisable to implement the MTTKRP in such a way that its output MA, after transformed into A, is communicated.
Q8. What is the main obstacle for further scalability of the fastest proposed method?
In their analysis and experiments, the authors identified the communication latency as the dominant hindrance for further scalability of the fastest proposed method.
Q9. How many machines are used in the speedup study?
On a real world data, speedup studies with upto 100 machines (each machine has 2 quad-core Intel 2.83 GHz CPUs) are presented, where the speedup with 100 machines is 1.4 times the speedup with 25 machines.
Q10. How can the fine-grain MTTKRP achieve the performance?
The experiments showed that the proposed fine-grain MTTKRP can achieve the best performance with respect to other alternatives with a good partitioning, reaching up to 194x speedups on 512 cores.
Q11. What is the speedup of the Netflix tensor?
The authors first observe in Figure 2a that on the Netflix tensor ht-finegrain-hp clearly outperforms all other methods by achieving a speedup of 194x with 512 cores over a sequential execution, whereas ht-coarsegrain-hp, ht-coarsegrain-block, DFacTo, and ht-finegrain-random could only yield to 69x, 63x, 49x, and 40x speedups, respectively.
Q12. How many iterations did you run on the dataset?
The authors let the CPALS implementations run for 20 iterations on each data with R = 10, and record the average time spent per iteration.