Scheduling in mapreduce-like systems for fast completion time

doi:10.1109/INFCOM.2011.5935152

Proceedings ArticleDOI

Scheduling in mapreduce-like systems for fast completion time

Hyunseok Chang, +5 more

- pp 3074-3082

Chats0

TLDR

This paper devise various online and offline algorithms to arrive at a good ordering of jobs to minimize the overall job completion times, and proposes approximation algorithms that work within a factor of 3 of the optimal.

Abstract:

Large-scale data processing needs of enterprises today are primarily met with distributed and parallel computing in data centers. MapReduce has emerged as an important programming model for these environments. Since today's data centers run many MapReduce jobs in parallel, it is important to find a good scheduling algorithm that can optimize the completion times of these jobs. While several recent papers focused on optimizing the scheduler, there exists very little theoretical understanding of the scheduling problem in the context of MapReduce. In this paper, we seek to address this problem by first presenting a simplified abstraction of the MapReduce scheduling problem, and then formulate the scheduling problem as an optimization problem.We devise various online and offline algorithms to arrive at a good ordering of jobs to minimize the overall job completion times. Since optimal solutions are hard to compute (NP-hard), we propose approximation algorithms that work within a factor of 3 of the optimal. Using simulations, we also compare our online algorithm with standard scheduling strategies such as FIFO, Shortest Job First and show that our algorithm consistently outperforms these across different job distributions.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications

Lena Mashayekhy, +4 more

- 01 Oct 2015 -

IEEE Transactions on Parallel and Distri...

TL;DR: This paper proposes two heuristic algorithms, called energy-aware MapReduce scheduling algorithms (EMRSA-I and EMRSA-II), that find the assignments of map and reduce tasks to the machine slots in orderto minimize the energy consumed when executing the application.

...read moreread less

Proceedings ArticleDOI

Joint scheduling of processing and Shuffle phases in MapReduce systems

Fangfei Chen, +2 more

TL;DR: This paper considers the problem of jointly scheduling all three phases of the MapReduce process with a view of understanding the theoretical complexity of the joint scheduling and working towards practical heuristics for scheduling the tasks.

...read moreread less

Patent

Resource aware scheduling in a distributed computing environment

Xiaoqiao Meng, +2 more

TL;DR: In this article, the authors present a system and methods for resource aware scheduling of processes in a distributed computing environment and present a comparison of the current reward value and the prospective reward value.

...read moreread less

Journal ArticleDOI

From the Cloud to the Atmosphere: Running MapReduce across Data Centers

Chamikara Jayalath, +2 more

- 01 Jan 2014 -

IEEE Transactions on Computers

TL;DR: G-MR is introduced, a system for executing sequences of MapReduce jobs on geo-distributed data sets, which implements the optimization framework, and evaluations show that using G-MR significantly improves processing time and cost for geodistributed data set.

...read moreread less

Journal ArticleDOI

Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds

Yang Wang, +1 more

TL;DR: Two greedy algorithms are developed, called Global Greedy Budget and Gradual Refinement, which show the efficiencies of the greedy algorithms in cost-effectiveness to distribute the budget for performance optimizations of the MapReduce workflows.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Proceedings ArticleDOI

Improving MapReduce performance in heterogeneous environments

Matei Zaharia, +4 more

TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.

...read moreread less

Proceedings ArticleDOI

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Matei Zaharia, +5 more

TL;DR: This work proposes a simple algorithm called delay scheduling, which achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness.

...read moreread less

Proceedings ArticleDOI

Quincy: fair scheduling for distributed computing clusters

Michael Isard, +5 more

TL;DR: It is argued that data-intensive computation benefits from a fine-grain resource sharing model that differs from the coarser semi-static resource allocations implemented by most existing cluster computing architectures.

...read moreread less

Collapse

Related Papers (5)

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

Scheduling in mapreduce-like systems for fast completion time

Citations

Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications

Joint scheduling of processing and Shuffle phases in MapReduce systems

Resource aware scheduling in a distributed computing environment

From the Cloud to the Atmosphere: Running MapReduce across Data Centers

Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Improving MapReduce performance in heterogeneous environments

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Quincy: fair scheduling for distributed computing clusters

Related Papers (5)

MapReduce: simplified data processing on large clusters

Improving MapReduce performance in heterogeneous environments

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Scheduling Hadoop Jobs to Meet Deadlines

ARIA: automatic resource inference and allocation for mapreduce environments