Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

Open AccessProceedings Article

Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

Qian Yu, +2 more

- Vol. 30, pp 4406-4416

Chats0

TLDR

This work considers a large-scale matrix multiplication problem where the computation is carried out using a distributed system with a master node and multiple worker nodes, where each worker can store parts of the input matrices, and proposes a computation strategy that leverages ideas from coding theory to design intermediate computations at the worker nodes to efficiently deal with straggling workers.

Abstract:

We consider a large-scale matrix multiplication problem where the computation is carried out using a distributed system with a master node and multiple worker nodes, where each worker can store parts of the input matrices. We propose a computation strategy that leverages ideas from coding theory to design intermediate computations at the worker nodes, in order to optimally deal with straggling workers. The proposed strategy, named as polynomial codes, achieves the optimum recovery threshold, defined as the minimum number of workers that the master needs to wait for in order to compute the output. This is the first code that achieves the optimal utilization of redundancy for tolerating stragglers or failures in distributed matrix multiplication. Furthermore, by leveraging the algebraic structure of polynomial codes, we can map the reconstruction problem of the final output to a polynomial interpolation problem, which can be solved efficiently. Polynomial codes provide order-wise improvement over the state of the art in terms of recovery threshold, and are also optimal in terms of several other metrics including computation latency and communication load. Moreover, we extend this code to distributed convolution and show its order-wise optimality.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing

Songze Li, +3 more

- 01 Jan 2018 -

IEEE Transactions on Information Theory

TL;DR: A coded scheme, named “coded distributed computing” (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of r can create novel coding opportunities that reduce the communication load by the same factor.

...read moreread less

Posted Content

A Fundamental Tradeoff between Computation and Communication in Distributed Computing

Songze Li, +3 more

- 24 Apr 2016 -

arXiv: Information Theory

TL;DR: In this article, a coded distributed computing (CDC) scheme is proposed to reduce the communication load in distributed computing, where the overall computation is decomposed into computing a set of Map and Reduce functions distributedly across multiple computing nodes.

...read moreread less

Journal ArticleDOI

On the Optimal Recovery Threshold of Coded Matrix Multiplication

Sanghamitra Dutta, +5 more

- 01 Jan 2020 -

IEEE Transactions on Information Theory

TL;DR: Novel coded computation strategies for distributed matrix–matrix products that outperform the recent “Polynomial code” constructions in recovery threshold, i.e., the required number of successful workers are provided.

...read moreread less

Proceedings ArticleDOI

Improving Distributed Gradient Descent Using Reed-Solomon Codes

Wael Halbawi, +3 more

TL;DR: In this article, the authors adopt the framework of Tandon et al. and present a deterministic scheme that, for a prescribed per-machine computational effort, recovers the gradient from the least number of machines $f$ theoretically permissible, via an O(f 2 ) decoding algorithm.

...read moreread less

Journal ArticleDOI

Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding

Qian Yu, +2 more

- 03 Jan 2020 -

IEEE Transactions on Information Theory

TL;DR: While evaluating bilinear complexity is a well-known challenging problem, it is shown that optimal recovery threshold for linear coding strategies can be approximated within a factor of 2 of this fundamental quantity.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Proceedings Article

Spark: cluster computing with working sets

Matei Zaharia, +4 more

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

Proceedings ArticleDOI

Improving MapReduce performance in heterogeneous environments

Matei Zaharia, +4 more

TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.

...read moreread less

Journal ArticleDOI

The tail at scale

Jeffrey Dean, +1 more

- 01 Feb 2013 -

Communications of The ACM

TL;DR: Software techniques that tolerate latency variability are vital to building responsive large-scale Web services.

...read moreread less

Journal ArticleDOI

Algorithm-Based Fault Tolerance for Matrix Operations

Kuang-Hua Huang, +1 more

- 01 Jun 1984 -

IEEE Transactions on Computers

TL;DR: Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.

...read moreread less

Related Papers (5)

Speeding Up Distributed Machine Learning Using Codes

Kangwook Lee, +4 more

- 01 Mar 2018 -

IEEE Transactions on Information Theory

Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

Citations

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing

A Fundamental Tradeoff between Computation and Communication in Distributed Computing

On the Optimal Recovery Threshold of Coded Matrix Multiplication

Improving Distributed Gradient Descent Using Reed-Solomon Codes

Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding

References

MapReduce: simplified data processing on large clusters

Spark: cluster computing with working sets

Improving MapReduce performance in heterogeneous environments

The tail at scale

Algorithm-Based Fault Tolerance for Matrix Operations

Related Papers (5)

Speeding Up Distributed Machine Learning Using Codes

Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products

Gradient Coding: Avoiding Stragglers in Distributed Learning

High-dimensional coded matrix multiplication

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing