A Fundamental Tradeoff between Computation and Communication in Distributed Computing

Open AccessPosted Content

A Fundamental Tradeoff between Computation and Communication in Distributed Computing

Songze Li, +3 more

- 24 Apr 2016 -

arXiv: Information Theory

Chats0

TLDR

In this article, a coded distributed computing (CDC) scheme is proposed to reduce the communication load in distributed computing, where the overall computation is decomposed into computing a set of Map and Reduce functions distributedly across multiple computing nodes.

Abstract:

How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing, i.e., the two are inversely proportional to each other. More specifically, a general distributed computing framework, motivated by commonly used structures like MapReduce, is considered, where the overall computation is decomposed into computing a set of "Map" and "Reduce" functions distributedly across multiple computing nodes. A coded scheme, named "Coded Distributed Computing" (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of $r$ (i.e., evaluating each function at $r$ carefully chosen nodes) can create novel coding opportunities that reduce the communication load by the same factor. An information-theoretic lower bound on the communication load is also provided, which matches the communication load achieved by the CDC scheme. As a result, the optimal computation-communication tradeoff in distributed computing is exactly characterized. Finally, the coding techniques of CDC is applied to the Hadoop TeraSort benchmark to develop a novel CodedTeraSort algorithm, which is empirically demonstrated to speed up the overall job execution by $1.97\times$ - $3.39\times$, for typical settings of interest.

A Fundamental Tradeoff between Computation and Communication in Distributed Computing

Citations

Gradient Coding: Avoiding Stragglers in Distributed Learning

The Exact Rate-Memory Tradeoff for Caching With Uncoded Prefetching

Speeding Up Distributed Machine Learning Using Codes

The Role of Caching in Future Communication Systems and Networks

On the Optimal Recovery Threshold of Coded Matrix Multiplication

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Network information flow

Spark: cluster computing with working sets

Fog computing and its role in the internet of things

Related Papers (5)

MapReduce: simplified data processing on large clusters

A Unified Coding Framework for Distributed Computing with Straggling Servers

Fundamental Limits of Caching

Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products

The tail at scale