scispace - formally typeset
Open AccessPosted Content

A Fundamental Tradeoff between Computation and Communication in Distributed Computing

Reads0
Chats0
TLDR
In this article, a coded distributed computing (CDC) scheme is proposed to reduce the communication load in distributed computing, where the overall computation is decomposed into computing a set of Map and Reduce functions distributedly across multiple computing nodes.
Abstract
How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing, i.e., the two are inversely proportional to each other. More specifically, a general distributed computing framework, motivated by commonly used structures like MapReduce, is considered, where the overall computation is decomposed into computing a set of "Map" and "Reduce" functions distributedly across multiple computing nodes. A coded scheme, named "Coded Distributed Computing" (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of $r$ (i.e., evaluating each function at $r$ carefully chosen nodes) can create novel coding opportunities that reduce the communication load by the same factor. An information-theoretic lower bound on the communication load is also provided, which matches the communication load achieved by the CDC scheme. As a result, the optimal computation-communication tradeoff in distributed computing is exactly characterized. Finally, the coding techniques of CDC is applied to the Hadoop TeraSort benchmark to develop a novel CodedTeraSort algorithm, which is empirically demonstrated to speed up the overall job execution by $1.97\times$ - $3.39\times$, for typical settings of interest.

read more

Citations
More filters
Proceedings Article

Gradient Coding: Avoiding Stragglers in Distributed Learning

TL;DR: This work proposes a novel coding theoretic framework for mitigating stragglers in distributed learning and shows how carefully replicating data blocks and coding across gradients can provide tolerance to failures andstragglers for synchronous Gradient Descent.
Journal ArticleDOI

The Exact Rate-Memory Tradeoff for Caching With Uncoded Prefetching

TL;DR: A novel caching scheme is proposed, which strictly improves the state of the art by exploiting commonality among user demands and fully characterize the rate-memory tradeoff for a decentralized setting, in which users fill out their cache content without any coordination.
Journal ArticleDOI

Speeding Up Distributed Machine Learning Using Codes

TL;DR: In this paper, the authors provide theoretical insights on how coded solutions can achieve significant gains compared to uncoded ones for matrix multiplication and data shuffling in large-scale distributed systems.
Journal ArticleDOI

The Role of Caching in Future Communication Systems and Networks

TL;DR: Caching has been studied for more than 40 years and has recently received increased attention from industry and academia as mentioned in this paper, with the following goal: to convince the reader that content caching is an exciting research topic for the future communication systems and networks.
Journal ArticleDOI

On the Optimal Recovery Threshold of Coded Matrix Multiplication

TL;DR: Novel coded computation strategies for distributed matrix–matrix products that outperform the recent “Polynomial code” constructions in recovery threshold, i.e., the required number of successful workers are provided.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

Network information flow

TL;DR: This work reveals that it is in general not optimal to regard the information to be multicast as a "fluid" which can simply be routed or replicated, and by employing coding at the nodes, which the work refers to as network coding, bandwidth can in general be saved.
Proceedings Article

Spark: cluster computing with working sets

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings ArticleDOI

Fog computing and its role in the internet of things

TL;DR: This paper argues that the above characteristics make the Fog the appropriate platform for a number of critical Internet of Things services and applications, namely, Connected Vehicle, Smart Grid, Smart Cities, and, in general, Wireless Sensors and Actuators Networks (WSANs).
Related Papers (5)