scispace - formally typeset
Open AccessJournal ArticleDOI

A Fundamental Tradeoff Between Computation and Communication in Distributed Computing

TLDR
A coded scheme, named “coded distributed computing” (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of r can create novel coding opportunities that reduce the communication load by the same factor.
Abstract
How can we optimally trade extra computing power to reduce the communication load in distributed computing? We answer this question by characterizing a fundamental tradeoff between computation and communication in distributed computing, ie, the two are inversely proportional to each other More specifically, a general distributed computing framework, motivated by commonly used structures like MapReduce, is considered, where the overall computation is decomposed into computing a set of “Map” and “Reduce” functions distributedly across multiple computing nodes A coded scheme, named “coded distributed computing” (CDC), is proposed to demonstrate that increasing the computation load of the Map functions by a factor of $r$ (ie, evaluating each function at $r$ carefully chosen nodes) can create novel coding opportunities that reduce the communication load by the same factor An information-theoretic lower bound on the communication load is also provided, which matches the communication load achieved by the CDC scheme As a result, the optimal computation-communication tradeoff in distributed computing is exactly characterized Finally, the coding techniques of CDC is applied to the Hadoop TeraSort benchmark to develop a novel CodedTeraSort algorithm, which is empirically demonstrated to speed up the overall job execution by $197\times $ – $339\times $ , for typical settings of interest

read more

Citations
More filters
Proceedings Article

Gradient Coding: Avoiding Stragglers in Distributed Learning

TL;DR: This work proposes a novel coding theoretic framework for mitigating stragglers in distributed learning and shows how carefully replicating data blocks and coding across gradients can provide tolerance to failures andstragglers for synchronous Gradient Descent.
Proceedings Article

Polynomial codes: an optimal design for high-dimensional coded matrix multiplication

TL;DR: This work considers a large-scale matrix multiplication problem where the computation is carried out using a distributed system with a master node and multiple worker nodes, where each worker can store parts of the input matrices, and proposes a computation strategy that leverages ideas from coding theory to design intermediate computations at the worker nodes to efficiently deal with straggling workers.
Journal ArticleDOI

The Exact Rate-Memory Tradeoff for Caching With Uncoded Prefetching

TL;DR: A novel caching scheme is proposed, which strictly improves the state of the art by exploiting commonality among user demands and fully characterize the rate-memory tradeoff for a decentralized setting, in which users fill out their cache content without any coordination.
Journal ArticleDOI

The Role of Caching in Future Communication Systems and Networks

TL;DR: Caching has been studied for more than 40 years and has recently received increased attention from industry and academia as mentioned in this paper, with the following goal: to convince the reader that content caching is an exciting research topic for the future communication systems and networks.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

Network information flow

TL;DR: This work reveals that it is in general not optimal to regard the information to be multicast as a "fluid" which can simply be routed or replicated, and by employing coding at the nodes, which the work refers to as network coding, bandwidth can in general be saved.
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings Article

Spark: cluster computing with working sets

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings ArticleDOI

Fog computing and its role in the internet of things

TL;DR: This paper argues that the above characteristics make the Fog the appropriate platform for a number of critical Internet of Things services and applications, namely, Connected Vehicle, Smart Grid, Smart Cities, and, in general, Wireless Sensors and Actuators Networks (WSANs).
Related Papers (5)