Speeding Up Distributed Machine Learning Using Codes

doi:10.1109/TIT.2017.2736066

Open AccessJournal ArticleDOI

Speeding Up Distributed Machine Learning Using Codes

Kangwook Lee, +4 more

- 01 Mar 2018 -

IEEE Transactions on Information Theory

- Vol. 64, Iss: 3, pp 1514-1529

TLDR

In this paper, the authors provide theoretical insights on how coded solutions can achieve significant gains compared with uncoded ones for matrix multiplication and data shuffling in large-scale distributed systems.

Abstract:

Codes are widely used in many engineering applications to offer robustness against noise . In large-scale systems, there are several types of noise that can affect the performance of distributed machine learning algorithms—straggler nodes, system failures, or communication bottlenecks—but there has been little interaction cutting across codes, machine learning, and distributed systems. In this paper, we provide theoretical insights on how coded solutions can achieve significant gains compared with uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling . For matrix multiplication, we use codes to alleviate the effect of stragglers and show that if the number of homogeneous workers is $n$ , and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of $\log n$ . For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction $\alpha $ of the data matrix can be cached at each worker, and $n$ is the number of workers, coded shuffling reduces the communication cost by a factor of $\left({\alpha + \frac {1}{n}}\right)\gamma (n)$ compared with uncoded shuffling, where $\gamma (n)$ is the ratio of the cost of unicasting $n$ messages to $n$ users to multicasting a common message (of the same size) to $n$ users. For instance, $\gamma (n) \simeq n$ if multicasting a message to $n$ users is as cheap as unicasting a message to one user. We also provide experimental results, corroborating our theoretical gains of the coded algorithms.

Speeding Up Distributed Machine Learning Using Codes

Citations

Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air

Gradient Coding: Avoiding Stragglers in Distributed Learning

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing

A Fundamental Tradeoff between Computation and Communication in Distributed Computing

Joint Device Scheduling and Resource Allocation for Latency Constrained Wireless Federated Learning

References

MapReduce: simplified data processing on large clusters

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Spark: cluster computing with working sets

Large Scale Distributed Deep Networks

Related Papers (5)

Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

Gradient Coding: Avoiding Stragglers in Distributed Learning

Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing

The tail at scale