Speeding Up Distributed Machine Learning Using Codes
TLDR
In this paper, the authors provide theoretical insights on how coded solutions can achieve significant gains compared with uncoded ones for matrix multiplication and data shuffling in large-scale distributed systems.Abstract:
Codes are widely used in many engineering applications to offer robustness against noise . In large-scale systems, there are several types of noise that can affect the performance of distributed machine learning algorithms—straggler nodes, system failures, or communication bottlenecks—but there has been little interaction cutting across codes, machine learning, and distributed systems. In this paper, we provide theoretical insights on how coded solutions can achieve significant gains compared with uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling . For matrix multiplication, we use codes to alleviate the effect of stragglers and show that if the number of homogeneous workers is $n$ , and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of $\log n$ . For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction $\alpha $ of the data matrix can be cached at each worker, and $n$ is the number of workers, coded shuffling reduces the communication cost by a factor of $\left({\alpha + \frac {1}{n}}\right)\gamma (n)$ compared with uncoded shuffling, where $\gamma (n)$ is the ratio of the cost of unicasting $n$ messages to $n$ users to multicasting a common message (of the same size) to $n$ users. For instance, $\gamma (n) \simeq n$ if multicasting a message to $n$ users is as cheap as unicasting a message to one user. We also provide experimental results, corroborating our theoretical gains of the coded algorithms.read more
Citations
More filters
Journal ArticleDOI
Machine Learning at the Wireless Edge: Distributed Stochastic Gradient Descent Over-the-Air
TL;DR: This work introduces a novel analog scheme, called A-DSGD, which exploits the additive nature of the wireless MAC for over-the-air gradient computation, and provides convergence analysis for this approach.
Proceedings Article
Gradient Coding: Avoiding Stragglers in Distributed Learning
TL;DR: This work proposes a novel coding theoretic framework for mitigating stragglers in distributed learning and shows how carefully replicating data blocks and coding across gradients can provide tolerance to failures andstragglers for synchronous Gradient Descent.
Journal ArticleDOI
A Fundamental Tradeoff Between Computation and Communication in Distributed Computing
TL;DR: A coded scheme, named “coded distributed computing” (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of r can create novel coding opportunities that reduce the communication load by the same factor.
Posted Content
A Fundamental Tradeoff between Computation and Communication in Distributed Computing
TL;DR: In this article, a coded distributed computing (CDC) scheme is proposed to reduce the communication load in distributed computing, where the overall computation is decomposed into computing a set of Map and Reduce functions distributedly across multiple computing nodes.
Journal ArticleDOI
Joint Device Scheduling and Resource Allocation for Latency Constrained Wireless Federated Learning
TL;DR: In this paper, a joint device scheduling and resource allocation policy is proposed to maximize the model accuracy within a given total training time budget for latency constrained wireless FL, where a lower bound on the reciprocal of the training performance loss is derived.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Book
Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers
TL;DR: It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas.
Posted Content
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe,Christian Szegedy +1 more
TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.
Proceedings Article
Spark: cluster computing with working sets
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings Article
Large Scale Distributed Deep Networks
Jeffrey Dean,Greg S. Corrado,Rajat Monga,Kai Chen,Matthieu Devin,Mark Z. Mao,Marc'Aurelio Ranzato,Andrew W. Senior,Paul A. Tucker,Ke Yang,Quoc V. Le,Andrew Y. Ng +11 more
TL;DR: This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.