scispace - formally typeset
Proceedings ArticleDOI

Scheduling in mapreduce-like systems for fast completion time

Reads0
Chats0
TLDR
This paper devise various online and offline algorithms to arrive at a good ordering of jobs to minimize the overall job completion times, and proposes approximation algorithms that work within a factor of 3 of the optimal.
Abstract
Large-scale data processing needs of enterprises today are primarily met with distributed and parallel computing in data centers. MapReduce has emerged as an important programming model for these environments. Since today's data centers run many MapReduce jobs in parallel, it is important to find a good scheduling algorithm that can optimize the completion times of these jobs. While several recent papers focused on optimizing the scheduler, there exists very little theoretical understanding of the scheduling problem in the context of MapReduce. In this paper, we seek to address this problem by first presenting a simplified abstraction of the MapReduce scheduling problem, and then formulate the scheduling problem as an optimization problem.We devise various online and offline algorithms to arrive at a good ordering of jobs to minimize the overall job completion times. Since optimal solutions are hard to compute (NP-hard), we propose approximation algorithms that work within a factor of 3 of the optimal. Using simulations, we also compare our online algorithm with standard scheduling strategies such as FIFO, Shortest Job First and show that our algorithm consistently outperforms these across different job distributions.

read more

Citations
More filters
Journal ArticleDOI

Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications

TL;DR: This paper proposes two heuristic algorithms, called energy-aware MapReduce scheduling algorithms (EMRSA-I and EMRSA-II), that find the assignments of map and reduce tasks to the machine slots in orderto minimize the energy consumed when executing the application.
Proceedings ArticleDOI

Joint scheduling of processing and Shuffle phases in MapReduce systems

TL;DR: This paper considers the problem of jointly scheduling all three phases of the MapReduce process with a view of understanding the theoretical complexity of the joint scheduling and working towards practical heuristics for scheduling the tasks.
Patent

Resource aware scheduling in a distributed computing environment

TL;DR: In this article, the authors present a system and methods for resource aware scheduling of processes in a distributed computing environment and present a comparison of the current reward value and the prospective reward value.
Journal ArticleDOI

From the Cloud to the Atmosphere: Running MapReduce across Data Centers

TL;DR: G-MR is introduced, a system for executing sequences of MapReduce jobs on geo-distributed data sets, which implements the optimization framework, and evaluations show that using G-MR significantly improves processing time and cost for geodistributed data set.
Journal ArticleDOI

Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds

TL;DR: Two greedy algorithms are developed, called Global Greedy Budget and Gradual Refinement, which show the efficiencies of the greedy algorithms in cost-effectiveness to distribute the budget for performance optimizations of the MapReduce workflows.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Proceedings ArticleDOI

Improving MapReduce performance in heterogeneous environments

TL;DR: A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.
Proceedings ArticleDOI

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

TL;DR: This work proposes a simple algorithm called delay scheduling, which achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness.
Proceedings ArticleDOI

Quincy: fair scheduling for distributed computing clusters

TL;DR: It is argued that data-intensive computation benefits from a fine-grain resource sharing model that differs from the coarser semi-static resource allocations implemented by most existing cluster computing architectures.
Related Papers (5)