scispace - formally typeset
Open AccessProceedings ArticleDOI

Evaluating MapReduce for Multi-core and Multiprocessor Systems

Reads0
Chats0
TLDR
It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.
Abstract
This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code

read more

Content maybe subject to copyright    Report

Evaluating MapReduce for Multi-core and Multiprocessor Systems
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis
Computer Systems Laboratory
Stanford University
Abstract
This paper evaluates the suitability of the MapReduce
model for multi-core and multi-processor systems. MapRe-
duce was created by Google for application development
on data-centers with thousands of servers. It allows pro-
grammers to write functional-style code that is automati-
cally parallelized and scheduled in a distributed system.
We describe Phoenix, an implementation of MapReduce
for shared-memory systems that includes a programming
API and an efficient runtime system. The Phoenix run-
time automatically manages thread creation, dynamic task
scheduling, data partitioning, and fault tolerance across
processor nodes. We study Phoenix with multi-core and
symmetric multiprocessor systems and evaluate its perfor-
mance potential and error recovery features. We also com-
pare MapReduce code to code written in lower-level APIs
such as P-threads. Overall, we establish that, given a care-
ful implementation, MapReduce is a promising model for
scalable performance on shared-memory systems with sim-
ple parallel code.
1 Introduction
As multi-core chips become ubiquitous, we need parallel
programs that can exploit more than one processor. Tradi-
tional parallel programming techniques, such as message-
passing and shared-memory threads, are too cumbersome
for most developers. They require that the programmer
manages concurrency explicitly by creating threads and
synchronizing them through messages or locks. They also
require manual management of data locality. Hence, it is
very difficult to write correct and scalable parallel code for
non-trivial algorithms. Moreover, the programmer must of-
ten re-tune the code when the application is ported to a dif-
ferent or larger-scale system.
To simplify parallel coding, we need to develop two com-
ponents: a practical programming model that allows users
to specify concurrency and locality at a high level and an
Email addresses: {cranger, ramananr, penmetsa}@stanford.edu,
garybradski@gmail.com, and christos@ee.stanford.edu.
efficient runtime system that handles low-level mapping, re-
source management, and fault tolerance issues automati-
cally regardless of the system characteristics or scale. Nat-
urally, the two components are closely linked. Recently,
there has been a significant body of research towards these
goals using approaches such as streaming [13, 15], mem-
ory transactions [14, 5], data-flow based schemes [2], asyn-
chronous parallelism, and partitioned global address space
languages [6, 1, 7].
This paper presents Phoenix, a programming API and
runtime system based on Google’s MapReduce model [8].
MapReduce borrows two concepts from functional lan-
guages to express data-intensive algorithms. The Map func-
tion processes the input data and generates a set of interme-
diate key/value pairs. The Reduce function properly merges
the intermediate pairs which have the same key. Given such
a functional specification, the MapReduce runtime automat-
ically parallelizes the computation by running multiple map
and/or reduce tasks in parallel over disjoined portions of
the input or intermediate data. Google’s MapReduce im-
plementation facilitates processing of terabytes on clusters
with thousands of nodes. The Phoenix implementation is
based on the same principles but targets shared-memory
systems such as multi-core chips and symmetric multipro-
cessors.
Phoenix uses threads to spawn parallel Map or Reduce
tasks. It also uses shared-memory buffers to facilitate com-
munication without excessive data copying. The runtime
schedules tasks dynamically across the available processors
in order to achieve load balance and maximize task through-
put. Locality is managed by adjusting the granularity and
assignment of parallel tasks. The runtime automatically re-
covers from transient and permanent faults during task exe-
cution by repeating or re-assigning tasks and properly merg-
ing their output with that from the rest of the computation.
Overall, the Phoenix runtime handles the complicated con-
currency, locality, and fault-tolerance tradeoffs that make
parallel programming difficult. Nevertheless, it also allows
the programmer to provide application specific knowledge
such as custom data partitioning functions (if desired).
We evaluate Phoenix on commercial multi-core and mul-

tiprocessor systems and demonstrate that it leads to scal-
able performance in both environments. Through fault in-
jection experiments, we show that Phoenix can handle per-
manent and transient faults during Map and Reduce tasks
at a small performance penalty. Finally, we compare the
performance of Phoenix code to tuned parallel code written
directly with P-threads. Despite the overheads associated
with the MapReduce model, Phoenix provides similar per-
formance for many applications. Nevertheless, the stylized
key management and additional data copying in MapRe-
duce lead to significant performance losses for some ap-
plications. Overall, even though MapReduce may not be
applicable to all algorithms, it can be a valuable tool for
simple parallel programming and resource management on
shared-memory systems.
The rest of the paper is organized as follows. Section
2 provides an overview of MapReduce, while Section 3
presents our shared-memory implementation. Section 4 de-
scribes our evaluation methodology and Section 5 presents
the evaluation results. Section 6 reviews related work and
Section 7 concludes the paper.
2 MapReduce Overview
This section summarizes the basic principles of the
MapReduce model.
2.1 Programming Model
The MapReduce programming model is inspired by func-
tional languages and targets data-intensive computations.
The input data format is application-specific, and is spec-
ified by the user. The output is a set of <key,value>
pairs. The user expresses an algorithm using two functions,
Map and Reduce. The Map function is applied on the in-
put data and produces a list of intermediate <key,value>
pairs. The Reduce function is applied to all intermediate
pairs with the same key. It typically performs some kind of
merging operation and produces zero or more output pairs.
Finally, the output pairs are sorted by their key value. In
the simplest form of MapReduce programs, the program-
mer provides just the Map function. All other functionality,
including the grouping of the intermediate pairs which have
the same key and the final sorting, is provided by the run-
time.
The following pseudocode shows the basic structure of a
MapReduce program that counts the number of occurences
of each word in a collection of documents [8]. The map
function emits each word in the documents with the tempo-
rary count 1. The reduce function sums the counts for each
unique word.
// input: a document
// intermediate output: key=word; value=1
Map(void
*
input) {
for each word w in input
EmitIntermediate(w, 1);
}
// intermediate output: key=word; value=1
// output: key=word; value=occurences
Reduce(String key, Iterator values) {
int result = 0;
for each v in values
result += v;
Emit(w, result);
}
The main benefit of this model is simplicity. The pro-
grammer provides a simple description of the algorithm that
focuses on functionality and not on parallelization. The ac-
tual parallelization and the details of concurrency manage-
ment are left to the runtime system. Hence the program
code is generic and easily portable across systems. Nev-
ertheless, the model provides sufficient high-level informa-
tion for parallelization. The Map function can be executed
in parallel on non-overlapping portions of the input data and
the Reduce function can be executed in parallel on each set
of intermediate pairs with the same key. Similarly, since
it is explicitly known which pairs each function will oper-
ate upon, one can employ prefetching or other scheduling
optimizations for locality.
The critical question is how widely applicable is the
MapReduce model. Dean and Ghemawat provided several
examples of data-intensive problems that were successfully
coded with MapReduce, including a production indexing
system, distributed grep, web-link graph construction, and
statistical machine translation [8]. A recent study by Intel
has also concluded that many data-intensive computations
can be expressed as sums over data points [9]. Such compu-
tations should be a good match for the MapReduce model.
Nevertheless, an extensive evaluation of the applicability
and ease-of-use of the MapReduce model is beyond the
scope of this work. Our goal is to provide an efficient im-
plementation on shared-memory systems that demonstrates
its feasibility and enables programmers to experiment with
this programming approach.
2.2 Runtime System
The MapReduce runtime is responsible for paralleliza-
tion and concurrency control. To parallelize the Map func-
tion, it splits the input pairs into units that are processed
concurrently on multiple nodes. Next, the runtime parti-
tions the intermediate pairs using a scheme that keeps pairs
with the same key in the same unit. The partitions are
processed in parallel by Reduce tasks running on multi-
ple nodes. In both steps, the runtime must decide on fac-
tors such as the size of the units, the number of nodes in-
volved, how units are assigned to nodes dynamically, and
how buffer space is allocated. The decisions can be fully
automatic or guided by the programmer given application

specific knowledge (e.g., number of pairs produced by each
function or the distribution of keys). These decisions allow
the runtime to execute a program efficiently across a wide
range of machines and dataset scenarios without modifica-
tions to the source code. Finally, the runtime must merge
and sort the output pairs from all Reduce tasks.
The runtime can perform several optimizations. It can re-
duce function-call overheads by increasing the granularity
of Map or Reduce tasks. It can also reduce load imbal-
ance by adjusting task granularity or the number of nodes
used. The runtime can also optimize locality in several
ways. First, each node can prefetch pairs for its current
Map or Reduce tasks using hardware or software schemes.
A node can also prefetch the input for its next Map or Re-
duce task while processing the current one, which is simi-
lar to the double-buffering schemes used in streaming mod-
els [23]. Bandwidth and cache space can be preserved using
hardware compression of intermediate pairs which tend to
have high redundancy [10].
The runtime can also assist with fault tolerance. When it
detects that a node has failed, it can re-assign the Map or
Reduce task it was processing at the time to another node.
To avoid interference, the replicated task will use separate
output buffers. If a portion of the memory is corrupted, the
runtime can re-execute just the necessary Map or Reduce
tasks that will re-produce the lost data. It is also possible to
produce a meaningful partial or approximated output even
when some input or intermediate data is permanently lost.
Moreover, the runtime can dynamically adjust the number
of nodes it uses to deal with failures or power and tempera-
ture related issues.
Google’s runtime implementation targets large clusters of
Linux PCs connected through Ethernet switches [3]. Tasks
are forked using remote procedure calls. Buffering and
communication occurs by reading and writing files on a dis-
tributed file system [12]. The locality optimizations focus
mostly on avoiding remote file accesses. While such a sys-
tem is effective with distributed computing [8], it leads to
very high overheads if used with shared-memory systems
that facilitate communication through memory and are typ-
ically of much smaller scale.
The critical question for the runtime is how significant
are the overheads it introduces. The MapReduce model re-
quires that data is associated with keys and that pairs are
handled in a specific manner at each execution step. Hence,
there can be non-trivial overheads due to key management,
data copying, data sorting, or memory allocation between
execution steps. While programmers may be willing to sac-
rifice some of the parallel efficiency in return for a simple
programming model, we must show that the overheads are
not overwhelming.
3 The Phoenix System
Phoenix implements MapReduce for shared-memory
systems. Its goal is to support efficient execution on mul-
tiple cores without burdening the programmer with concur-
rency management. Phoenix consists of a simple API that
is visible to application programmers and an efficient run-
time that handles parallelization, resource management, and
fault recovery.
3.1 The Phoenix API
The current Phoenix implementation provides an
application-programmer interface (API) for C and C++.
However, similar APIs can be defined for languages like
Java or C#. The API includes two sets of functions sum-
marized in Table 1. The first set is provided by Phoenix
and is used by the programmer’s application code to ini-
tialize the system and emit output pairs (1 required and
2 optional functions). The second set includes the func-
tions that the programmer defines (3 required and 2 optional
functions). Apart from the Map and Reduce functions, the
user provides functions that partition the data before each
step and a function that implements key comparison. Note
that the API is quite small compared to other models. The
API is type agnostic. The function arguments are declared
as void pointers wherever possible to provide flexibility in
their declaration and fast use without conversion overhead.
In constrast, the Google implementation uses strings for ar-
guments as string manipulation is inexpensive compared to
remote procedure calls and file accesses.
The data structure used to communicate basic function
information and buffer allocation between the user code and
runtime is of type scheduler args t. Its fields are sum-
marized in Table 2. The basic fields provide pointers to in-
put/output data buffers and to the user-provided functions.
They must be properly set by the programmer before call-
ing phoenix scheduler(). The remaining fields are
optionally used by the programmer to control scheduling
decisions by the runtime. We discuss these decisions further
in Section 3.2.4. There are additional data structure types to
facilitate communication between the Splitter, Map, Parti-
tion, and Reduce functions. These types use pointers when-
ever possible to implement communication without actually
copying significant amounts of data.
The API guarantees that within a partition of the interme-
diate output, the pairs will be processed in key order. This
makes it easier to produce a sorted final output which is of-
ten desired. There is no guarantee in the processing order of
the original input during the Map stage. These assumptions
did not cause any complications with the programs we ex-
amined. In general it is up to the programmer to verify that
the algorithm can be expressed with the Phoenix API given
these restrictions.
The Phoenix API does not rely on any specific com-

Function Description R/O
Functions Provided by Runtime
int phoenix scheduler (scheduler args t
*
args) R
Initializes the runtime system. The scheduler args t struct provides the needed function & data pointers
void emit intermediate(void
*
key, void
*
val, int key size) O
Used in Map to emit an intermediate output <key,value> pair. Required if the Reduce is defined
void emit(void
*
key, void
*
val) O
Used in Reduce to emit a final output pair
Functions Defined by User
int (
*
splitter t)(void
*
, int, map args t
*
) R
Splits the input data across Map tasks. The arguments are the input data pointer, the unit size for each task, and the
input buffer pointer for each Map task
void (
*
map t)(map args t
*
) R
The Map function. Each Map task executes this function on its input
int (
*
partition t)(int, void
*
, int) O
Partitions intermediate pair for Reduce tasks based on their keys. The arguments are the number of Reduce tasks, a
pointer to the keys, and a the size of the key. Phoenix provides a default partitioning function based on key hashing
void (
*
reduce t)(void
*
, void
**
, int) O
The Reduce function. Each reduce task executes this on its input. The arguments are a pointer to a key, a pointer to the
associated values, and value count. If not specified, Phoenix uses a default identity function
int (
*
key cmp t)(const void
*
, const void
*
) R
Function that compares two keys
Table 1. The functions in the Phoenix API. R and O identify required and optional fuctions respectively.
piler options and does not require a parallelizing com-
piler. However, it assumes that its functions can freely
use stack-allocated and heap-allocated stuctures for pri-
vate data. It also assumes that there is no communica-
tion through shared-memory structures other than the in-
put/output buffers for these functions. For C/C++, we can-
not check these assumptions statically for arbitrary pro-
grams. Although there are stringent checks within the sys-
tem to ensure valid data are communicated between user
and runtime code, eventually we trust the user to provide
functionally correct code. For Java and C#, static checks
that validate these assumptions are possible.
3.2 The Phoenix Runtime
The Phoenix runtime was developed on top of P-
threads [18], but can be easily ported to other shared-
memory thread packages.
3.2.1 Basic Operation and Control Flow
Figure 1 shows the basic data flow for the runtime system.
The runtime is controlled by the scheduler, which is initi-
ated by user code. The scheduler creates and manages the
threads that run all Map and Reduce tasks. It also manages
the buffers used for task communication. The programmer
provides the scheduler with all the required data and func-
tion pointers through the scheduler args t structure.
After initialization, the scheduler determines the number of
cores to use for this computation. For each core, it spawns
a worker thread that is dynamically assigned some number
of Map and Reduce tasks.
To start the Map stage, the scheduler uses the Splitter
to divide input pairs into equally sized units to be processed
by the Map tasks. The Splitter is called once per Map
task and returns a pointer to the data the Map task will pro-
cess. The Map tasks are allocated dynamically to work-
ers and each one emits intermediate <key,value> pairs.
The Partition function splits the intermediate pairs into
units for the Reduce tasks. The function ensures all values
of the same key go to the same unit. Within each buffer,
values are ordered by key to assist with the final sorting. At
this point, the Map stage is over. The scheduler must wait
for all Map tasks to complete before initiating the Reduce
stage.
Reduce tasks are also assigned to workers dynamically,
similar to Map tasks. The one difference is that, while with
Map tasks we have complete freedom in distributing pairs
across tasks, with Reduce we must process all values for the
same key in one task. Hence, the Reduce stage may exhibit
higher imbalance across workers and dynamic scheduling is
more important. The output of each Reduce task is already
sorted by key. As the last step, the final output from all tasks
is merged into a single buffer, sorted by keys. The merging
takes place in log
2
(P/2) steps, where P is the number of
workers used. While one can imagine cases where the out-
put pairs do not have to be ordered, our current implemen-
tation always sorts the final output as it is also the case in
Google’s implementation [8].

Field Description
Basic Fields
Input data Input data pointer; passed to the Splitter by the runtime
Data size Input dataset size
Output data Output data pointer; buffer space allocated by user
Splitter Pointer to Splitter function
Map Pointer to Map function
Reduce Pointer to Reduce function
Partition Pointer to Partition function
Key cmp Pointer to key compare function
Optional Fields for Performance Tuning
Unit size Pairs processed per Map/Reduce task
L1 cache size L1 data cache size in bytes
Num Map workers Maximum number of threads (workers) for Map tasks
Num Reduce workers Maximum number of threads (workers) for Reduce tasks
Num Merge workers Maximum number of threads (workers) for Merge tasks
Num procs Maximum number of processors cores used
Table 2. The scheduler args t data structure type.
3.2.2 Buffer Management
Two types of temporary buffers are necessary to store data
between the various stages. All buffers are allocated in
shared memory but are accessed in a well specified way by
a few functions. Whenever we have to re-arrange buffers
(e.g., split across tasks), we manipulate pointers instead of
the actual pairs, which may be large in size. The intermedi-
ate buffers are not directly visible to user code.
Map-Reduce buffers are used to store the intermediate
output pairs. Each worker has its own set of buffers. The
buffers are initially sized to a default value and then resized
dynamically as needed. At this stage, there may be multiple
pairs with the same key. To accelerate the Partition
function, the Emit intermediate function stores all
values for the same key in the same buffer. At the end of
the Map task, we sort each buffer by key order. Reduce-
Merge buffers are used to store the outputs of Reduce tasks
before they are sorted. At this stage, each key has only one
value associated with it. After sorting, the final output is
available in the user allocated Output data buffer.
3.2.3 Fault Recovery
The runtime provides support for fault tolerance for tran-
sient and permanent faults during Map and Reduce tasks. It
focuses mostly on recovery with some limited support for
fault detection.
Phoenix detects faults through timeouts. If a worker does
not complete a task within a reasonable amount of time,
then a failure is assumed. The execution time of similar
tasks on other workers is used as a yardstick for the timeout
interval. Of course, a fault may cause a task to complete
with incorrect or incomplete data instead of failing com-
pletely. Phoenix has no way of detecting this case on its own
and cannot stop an affected task from potentially corrupt-
ing the shared memory. To address this shortcoming, one
should combine the Phoenix runtime with known error de-
tection techniques [20, 21, 24]. Due to the functional nature
of the MapReduce model, Phoenix can actually provide in-
formation that simplifies error detection. For example, since
the address ranges for input and output buffers are known,
Phoenix can notify the hardware about which load/store ad-
dresses to shared structures should be considered safe for
each worker and which should signal a potential fault.
Once a fault is detected or at least suspected, the runtime
attempts to re-execute the failed task. Since the original
task may still be running, separate output buffers are allo-
cated for the new task to avoid conflicts and data corruption.
When one of the two tasks completes successfully, the run-
time considers the task completed and merges its result with
the rest of the output data for this stage. The scheduler ini-
tially assumes that the fault was a transient one and assigns
the replicated task to the same worker. If the task fails a
few times or a worker exhibits a high frequency of failed
tasks overall, the scheduler assumes a permanent fault and
no further tasks are assigned to this worker.
The current Phoenix code does not provide fault recovery
for the scheduler itself. The scheduler runs only for a very
small fraction of the time and has a small memory footprint,
hence it is less likely to be affected by a transient error. On
the other hand, a fault in the scheduler has more serious im-
plications for the program correctness. We can use known
techniques such as redundant execution or checkpointing to
address this shortcoming.
Google’s MapReduce system uses a different approach

Citations
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

TL;DR: This paper is aimed to demonstrate a close-up view about Big Data, including Big Data applications, Big Data opportunities and challenges, as well as the state-of-the-art techniques and technologies currently adopt to deal with the Big Data problems.
Journal ArticleDOI

Data mining with big data

TL;DR: A HACE theorem is presented that characterizes the features of the Big Data revolution, and a Big Data processing model is proposed, from the data mining perspective, which involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations.
Book

Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library

TL;DR: Whether you want to build simple or sophisticated vision applications, Learning OpenCV is the book any developer or hobbyist needs to get started, with the help of hands-on exercises in each chapter.
Proceedings ArticleDOI

Twister: a runtime for iterative MapReduce

TL;DR: This paper presents the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative Map Reduce computations efficiently and shows performance comparisons of Twisters with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings ArticleDOI

X10: an object-oriented approach to non-uniform cluster computing

TL;DR: A modern object-oriented programming language, X10, is designed for high performance, high productivity programming of NUCC systems and an overview of the X10 programming model and language, experience with the reference implementation, and results from some initial productivity comparisons between the X 10 and Java™ languages are presented.
Proceedings ArticleDOI

The implementation of the Cilk-5 multithreaded language

TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.
Journal ArticleDOI

Parallel Prefix Computation

TL;DR: A recurstve construction is used to obtain a product circuit for solving the prefix problem and a Boolean clrcmt which has depth 2[Iog2n] + 2 and size bounded by 14n is obtained for n-bit binary addmon.
Frequently Asked Questions (19)
Q1. What are the contributions in "Evaluating mapreduce for multi-core and multiprocessor systems" ?

This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. The authors describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The authors study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. Overall, the authors establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code. 

Bandwidth and cache space can be preserved using hardware compression of intermediate pairs which tend to have high redundancy [10]. 

The MapReduce model requires that data is associated with keys and that pairs are handled in a specific manner at each execution step. 

A node can also prefetch the input for its next Map or Reduce task while processing the current one, which is similar to the double-buffering schemes used in streaming models [23]. 

As work is distributed across more cores, the heaps accessed by each core are smaller and operations on them become significantly faster. 

The reduce function sums the counts for each unique word.// input: a document // intermediate output: key=word; value=1Map(void *input) { for each word w in inputEmitIntermediate(w, 1); }// intermediate output: key=word; value=1 // output: key=word; value=occurencesReduce(String key, Iterator values) { int result = 0; for each v in values result += v; Emit(w, result); 

Apart from ease-of-use and scalability, two factors that may affect their acceptance is how well they run on existing hardware and if they can tolerate errors. 

Through fault injection experiments, the authors show that Phoenix can handle permanent and transient faults during Map and Reduce tasks at a small performance penalty. 

Certain applications, such as WordCount and ReverseIndex, fit well with the MapReduce model and lead to very compact and simple Phoenix code. 

Number of Cores and Workers/Core: Since MapReduce programs are data-intensive, the authors currently spawn workers to all available cores. 

The conclusion from Figure 6 is that, given an efficient implementation, MapReduce is an attractive model for some classes of computation. 

In general, there are three scheduling approaches one can employ: 1) use a default policy for the specific system which has been developed taking into account its characteristics; 2) dynamically determine the best policy for each decision by monitoring resource availability and runtime behavior; 3) allow the programmer to provide application specific policies. 

In a multi-programming environment, the scheduler can periodically check the system load and scale its usage based on system-wide priorities. 

The runtime can trigger a prefetch engine that brings the data for the next task to the L2 cache in parallel with processing the current task. 

At the beginning of a data intensive program, the runtime can vary the unit size and monitor the trends in the completion time or other performance indicators (processor utilization, number of misses, etc.) in order to select the best possible value. 

Dean and Ghemawat provided several examples of data-intensive problems that were successfully coded with MapReduce, including a production indexing system, distributed grep, web-link graph construction, and statistical machine translation [8]. 

The algorithm assigns different portions of the file to different map tasks, which compute certain summary statistics like the sum of squares. 

The key-based structure that MapReduce uses fits well the algorithm of WordCount, MatrixMultiply, StringMatch, and LinearRegression. 

The algorithm assigns different portions of the image to different Map tasks, which parse the image and insert the frequency of component occurences into arrays.