scispace - formally typeset

Proceedings ArticleDOI

Performance Characterization of Spark Workloads on Shared NUMA Systems

01 Mar 2018-pp 41-48

TL;DR: This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes.
Abstract: As the adoption of Big Data technologies becomes the norm in an increasing number of scenarios, there is also a growing need to optimize them for modern processors. Spark has gained momentum over the last few years among companies looking for high performance solutions that can scale out across different cluster sizes. At the same time, modern processors can be connected to large amounts of physical memory, in the range of up to few terabytes. This opens an enormous range of opportunities for runtimes and applications that aim to improve their performance by leveraging low latencies and high bandwidth provided by RAM. The result is that there are several examples today of applications that have started pushing the in-memory computing paradigm to accelerate tasks. To deliver such a large physical memory capacity, hardware vendors have leveraged Non-Uniform Memory Architectures (NUMA). This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes. We explore several workloads run on top of the IBM Power8 processor, and provide manual strategies that can leverage performance improvements up to 40% on Spark workloads when using smart processor-pinning and workload collocation strategies.

Summary (3 min read)

Introduction

  • This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes.
  • To achieve a good performance for in-memory computing frameworks on a NUMA system, there is a need to understand the topology of the interconnect between processor sockets and memory banks.
  • This paper explores how Spark-based in-memory computing workloads are impacted by the effects of NUMA architecture, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to leverage memory-consumption patterns to smartly co-locate workloads in scale-up nodes.
  • Section IV introduces the evaluation methodology used for the experiments.

A. Apache Spark

  • Apache Spark [2] is a popular open-source platform for large-scale in-memory data processing developed at UC Berkeley.
  • Spark uses Resilient Distributed Datasets (RDDs) [3] which are immutable collections of objects spread across a cluster and hides the details of distribution and fault-tolerance for a larger collection of items.
  • Next, it splits the DAG into stages that contain pipelined transformations with dependencies.
  • Spark maintains a pool of executor threads which are used to execute the tasks in parallel.
  • The connection between the processor with the memory is composed of 8 links, with a link offering 9.6 GB/s write and 19.2 GB/s read bandwidth.

IV. METHODOLOGY

  • This section describes how the study on the impact of NUMA topology on in-memory data analytics workloads has been performed, as well as the rationale behind the experiments evaluated in the following sections.

A. Workloads

  • The experiments presented in this paper are based on SparkBench [17], which is a benchmark suite developed by IBM and widely tested in Power8 systems.
  • From the range of available workloads provided by the benchmark, Support Vector Machines (SVM), PageRank, and Spark SQL RDDRelation have been selected for the evaluation.
  • These workloads are wellknown in the literature, and combine different characteristics to cover a large range of possible configurations.
  • Dataset size for SVM, SQL and PageRank is 47, 24, and 17 GB and number of partitions are 376, 192 and 136 respectively for all experiments.

B. Experimental Setup

  • Since the goal of this paper is to evaluate the performance of Spark workloads on NUMA hardware, all the experiments are conducted in a single machine; the characteristics of the machine’s architecture are described in Section II-B.
  • All the other parameters and values used to configure Spark, during the experiments execution described later in this paper, are summarized in Table I. Hardware counters have been used to collect most realtime information from experimental executions, using the perfmon2 [19] library.
  • Memory bandwidth is calculated based on the formula defined in [20].
  • For CPU usage, memory usage, and context switches, the vmstat tool has been used.
  • To collect information about NUMA memory access, the numastat is used.

V. EXPERIMENT 1: WORKLOAD CHARACTERIZATION

  • This experiment consists of a performance characterization of Spark workloads, changing the configuration parameters of Spark itself and observing the impact of different configurations in terms of completion time and resource consumption.
  • More specifically, this experiment analyzes the effect of the software configuration in the resource usage intensiveness and possible speedups for the workloads described in Section IV-A.
  • This is due to SQL is more impacted because of thread locks and cache contention than the SVM.
  • Based on this property the authors classify configurations in different groups: • Within 10% of optimal: configurations for which completion time is very close to the best execution time observed for that particular workload.
  • This is due to more executors which execute more threads to process the tasks.

VI. EXPERIMENT 2: BINDING TO NUMA NODES

  • Allocating more NUMA nodes to a workload has the potential to increase resources (such as memory bandwidth, CPU, and memory) and possibly lower cache contention (due to the availability of additional cores and cache), but it can also involve a trade-off: using remote memory accesses and dealing with bus contention, which can lead to slowdowns in some scenarios.
  • In case of 2B, the optimal configurations are 8 cores per worker and 3 workers per node for SQL, 6 cores per worker and 6 workers per node for SVM and 4 cores per worker and 6 workers per node for PageRank.
  • The results of this experiment, as summarized in Table VII, show a significant speedup when comparing manual binding versus the OS allocating the resources, but not for all workloads.
  • In the case of SVM, SQL and PageRank, the speedups are x1.15, x1.07 and x1.25, respectively.
  • The results of this experiment also show that how applications scale with more NUMA nodes.

VII. EXPERIMENT 3: WORKLOAD CO-SCHEDULING

  • This final experiment explores the benefits of workload colocation and process binding (cores and memory) as a mechanism to improve system throughput and increase resource utilization.
  • This experiment, therefore, evaluates the performance impact on workloads when sharing the same machine, that is when workloads are co-located.
  • The authors repeat the process with the 1B-3B configurations (1B for 1 NUMA node with binding, and 3B for 3 NUMA nodes with binding), in which one workload gets assigned one NUMA node while the other gets allocated the other three nodes.
  • In all cases the authors executed all combinations of SQL-SVM, SQL-PageRank, SVM-PageRank.
  • In a case of remote memory access, there is an increase of 80-91.93% when the same experiments are executed without binding.

VIII. CONCLUSIONS

  • In-memory computing is becoming one of the most popular approaches for real-time big data processing as data sets grow and more memory capacity is made available to popular runtimes such as Spark.
  • To deliver large physical memory capacity, modern processors feature Non-Uniform Memory Architectures (NUMA).
  • Each socket can have multiple processors with its own memory.
  • Large sets of experiments were executed to evaluate several Spark workloads, and the results demonstrated that workload colocation is a smart strategy to improve resource utilization for memory-intensive workloads placed in modern NUMA processors.
  • Highly concurrent configurations produce undesired memory access patterns across NUMA nodes that push to the limit the existing memory bandwidth, making co-scheduling a good choice.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Performance Characterization of Spark Workloads
on Shared NUMA Systems
Shuja-ur-Rehman Baig
, Marcelo Amaral
, Jord
`
a Polo
, David Carrera
{shuja.baig, marcelo.amaral, jorda.polo, david.carrera}@bsc.es
Universitat Polit
`
ecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)
University of the Punjab (PU)
Abstract—As the adoption of Big Data technologies becomes
the norm in an increasing number of scenarios, there is also a
growing need to optimize them for modern processors. Spark
has gained momentum over the last few years among companies
looking for high performance solutions that can scale out across
different cluster sizes. At the same time, modern processors
can be connected to large amounts of physical memory, in
the range of up to few terabytes. This opens an enormous
range of opportunities for runtimes and applications that aim
to improve their performance by leveraging low latencies and
high bandwidth provided by RAM. The result is that there are
several examples today of applications that have started pushing
the in-memory computing paradigm to accelerate tasks. To deliver
such a large physical memory capacity, hardware vendors have
leveraged Non-Uniform Memory Architectures (NUMA).
This paper explores how Spark-based workloads are impacted
by the effects of NUMA-placement decisions, how different Spark
configurations result in changes in delivered performance, how
the characteristics of the applications can be used to predict
workload collocation conflicts, and how to improve performance
by collocating workloads in scale-up nodes. We explore several
workloads run on top of the IBM Power8 processor, and provide
manual strategies that can leverage performance improvements
up to 40% on Spark workloads when using smart processor-
pinning and workload collocation strategies.
Index Terms—Performance, Modeling, Characterization,
Memory, NUMA, Spark, Benchmark
I. INTRODUCTION
Nowadays, due to the growth in the number of cores in
modern processors, parallel systems are built using Non-
Uniform Memory Architecture (NUMA), which has gained
wide acceptance in the industry, setting the new standard for
building new generation enterprise servers. These processors
can be connected to large amounts of physical memory, in the
range of up to a couple of terabytes for the time being. This
opens an enormous range of opportunities for runtimes and
applications that aim to improve their performance by leverag-
ing low latencies and high bandwidth provided by RAM. The
result is that today there are several examples of applications
that have started pushing the in-memory computing paradigm
to accelerate tasks.
To deliver such a large physical memory capacity, sockets
in NUMA systems are connected through high performance
connections and each socket can have multiple processors
with its own memory. A process running on a NUMA system
can access the memory of its own node as well remote node
where the latency of memory accesses on remote nodes is
significantly high compared to local memory accesses [1].
Ideally, memory accesses are kept local in order to avoid
this latency and contention on interconnect links. Moreover,
the bandwidth of memory accesses to last-level caches and
DRAM memory also depends on the access type that is
local or remote. From the NUMA perspective, people want
to learn whether NUMA topology can meet the challenges of
in-memory computing frameworks, and if not, what kinds of
optimizations are required.
At the same time, as the adoption of Big Data technologies
becomes the norm in many scenarios, there is a growing
need to optimize them for modern processors. Spark [2] has
gained momentum over the last few years among companies
looking for high performance solutions that can scale out
across different cluster sizes. To achieve a good performance
for in-memory computing frameworks on a NUMA system,
there is a need to understand the topology of the interconnect
between processor sockets and memory banks. Additionally,
while a NUMA architecture can provide very high memory
capacity to the applications, it also implies the additional
complexity of letting the Operating System take care of many
critical decisions with respect to where data is physically
stored and where are processes accessing that data placed. This
fact may have no impact for many applications that are not
memory intensive, whereas memory-bound applications can be
seriously impacted by these decisions in terms of performance.
This paper explores how Spark-based in-memory computing
workloads are impacted by the effects of NUMA architecture,
how different Spark configurations result in changes in deliv-
ered performance, how the characteristics of the applications
can be used to predict workload collocation conflicts, and
how to leverage memory-consumption patterns to smartly
co-locate workloads in scale-up nodes. The evaluation also
characterizes several workloads running on top of the IBM
Power8 processor, and provides strategies that can lead to
performance improvements of up to 40% on Spark workloads
when using smart processor-pinning and workload collocation
strategies.
In summary, the main contributions of this paper are the
following:
A characterization of three representative memory-
intensive Spark workloads across multiple software con-
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/
republishing this material for advertising
or promotional purposes, creating new collective
works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works.

figurations on top of a modern NUMA processor (IBM
Power 8). The study illustrates how these workloads re-
quire a high level of concurrence to achieve full processor
utilization when running in isolation, but at the same
time they are limited by the processor memory bandwidth
when such levels of concurrence is reached, resulting in
poor performance and resource underutilization.
An evaluation of how process and memory binding
to NUMA nodes can provide means to improve the
performance of memory-intensive Spark workloads by
reducing the competition among software threads on high
concurrency situations.
Propose smart workload co-location strategies that lever-
age process and memory binding to NUMA nodes as
a mechanism to improve performance and resource uti-
lization for modern NUMA processors running memory-
intensive workloads.
The rest of the paper is organized as follows. Sections II
and III provide technological background as well as set
the state of the art for the work presented in this paper.
Section IV introduces the evaluation methodology used for
the experiments. Sections V, VI and VII describe experiments
performed to support the conclusions presented in this work.
Finally, Section VIII presents the conclusions to the work.
II. BACKGROUND
A. Apache Spark
Apache Spark [2] is a popular open-source platform for
large-scale in-memory data processing developed at UC
Berkeley. Spark is designed to avoid the file system as much
as possible, retaining most data resident in distributed memory
across phases in the same job. Spark uses Resilient Distributed
Datasets (RDDs) [3] which are immutable collections of ob-
jects spread across a cluster and hides the details of distribution
and fault-tolerance for a larger collection of items. Spark
provides a programming interface based on the two high-
level operations i) transformations ii) actions. Transformations
are lazy operations that create new RDDs. Actions launch a
computation on RDDs to return a value to the application or
write data to the distributed storage system. When a user runs
an action on an RDD, Spark first builds a directed acyclic
graph (DAG) of stages. Next, it splits the DAG into stages that
contain pipelined transformations with dependencies. Further,
it divides each stage into tasks. A task is a combination of
data and computation. Spark executes all the tasks of a stage
before going to next stage. Spark maintains a pool of executor
threads which are used to execute the tasks in parallel.
B. IBM Power 8
We run all the experiments in the IBM Power System 8247-
42L, which is 2-way 12-core IBM Power8 machine with all
cores at 3.02GHz and with each core able to run up to 8
Simultaneous Multi-Threading (SMT) [4]
Each Power8 processor is composed of two memory regions
(i.e NUMA node) with 6 cores and their own memory con-
troller and 256GB of RAM. The Power8 processor includes
four cache levels, consisting of a store-through L1 data cache,
a store-in L2 cache, an eDRAM-based L3 cache with a per-
core capacity of 64 KB, 512KB, and 8 MB, respectively. The
fourth cache level has 128 MB and consists of eight external
memory chips called Centaur (which is a DDR3 memory).
1
2
3 6
5
4
230.4 GB/s
8 * 19.2 = 153.6 GB/s read; 8 * 9.6 =76.8 GB/s write;
mem. bandwidth per socket=230.4 GB/s
256 GB
1
2
3 6
5
4
256 GB
1
2
3 6
5
4
230.4 GB/s
256 GB
1
2
3 6
5
4
256 GB
Fig. 1: Power8 NUMA architecture
Because of the main memory is connected to the processor
using separate links for reading and write operations, with
two links for memory reads and one link for memory writes,
the system has asymmetric read and write bandwidth. The
connection between the processor with the memory is com-
posed of 8 links, with a link offering 9.6 GB/s write and 19.2
GB/s read bandwidth. Therefore, the system has in total four
NUMA nodes, 192 virtual cores, 1 TB of RAM and a total of
230.4 GB/s of sustainable memory bandwidth per socket, as
illustrated in Figure 1. For the software stack, the machine is
configured with a Ubuntu 16.10, kernel 4.4.0-47-generic, IBM
java version 1.8.0 and Apache Spark 1.6.1.
III. RELATED WORK
Although there have been several research efforts to in-
vestigate and mitigate the impact of NUMA on workload
performance, this topic is still gaining traction in the literature
in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]. These
works characterize some key sources of workload performance
degradation related to NUMA (such as additional latencies
and possibly lower memory bandwidth because of remote
memory access and contentions on the memory controller,
bus connections or cache contention), and propose OS task
placement strategies for mitigating remote memory access.
But the characterization of Spark workloads on IBM Power8
systems and placement strategies for co-scheduled application
is still roughly understood.
For example, [6] has characterized the NUMA performance
impact related to remote memory access introduced by the OS
when performing task re-scheduling or load balancing. While
this work proposed an effective approach to mitigating remote
memory access and cache contention, it is not application-
driven and does not have a holistic view of all applications to
define efficient workloads co-scheduling on NUMA systems.
In our work, we demonstrate the potential benefits from man-
ual binding strategies when co-scheduling multiple workloads
on NUMA systems.

Another example is the work of [14], where the authors
characterized the performance impact of NUMA on graph-
analytics applications. They present an approach to minimize
remote memory access by using graph-aware data allocation
and access strategies. While this work presents an application-
driven investigation, it lacks the analyze of memory-intensive
Spark workloads and workload collocation.
The most related works to our work are the [15] and
[16]. In the former, the authors quantify the impact of data
locality on NUMA nodes for Spark workloads on Intel Ivy
Bridge server. In the later, the authors evaluate the impact
of NUMA locality on the performance of in-memory data
analytics with Spark on Intel Ivy Bridge server. In both
papers, they run benchmarks with two configurations a) Local
DRAM b) Remote DRAM. In Local DRAM, they bound Spark
process to processor 0 and memory node 0 and in Remote
DRAM, they bound the Spark process to processor 0 and
memory node 1. Then, they compare the results to evaluate the
performance impact of NUMA. While those works present a
detailed performance characterization of Spark workloads on
NUMA systems on an Intel Ivy Bridge server, the NUMA per-
formance characterization of IBM Power8 systems is still not
understood. Moreover, their work does not present the NUMA
impact of optimally binding the workloads versus leaving the
OS allocating the resources. Also, they do not evaluate the
performance benefits of performing manual binding for co-
scheduled Spark workloads as we present in this paper.
IV. METHODOLOGY
This section describes how the study on the impact of
NUMA topology on in-memory data analytics workloads has
been performed, as well as the rationale behind the experi-
ments evaluated in the following sections.
A. Workloads
The experiments presented in this paper are based on Spark-
Bench [17], which is a benchmark suite developed by IBM and
widely tested in Power8 systems. From the range of available
workloads provided by the benchmark, Support Vector Ma-
chines (SVM), PageRank, and Spark SQL RDDRelation have
been selected for the evaluation. These workloads are well-
known in the literature, and combine different characteristics
to cover a large range of possible configurations. Dataset size
for SVM, SQL and PageRank is 47, 24, and 17 GB and
number of partitions are 376, 192 and 136 respectively for
all experiments.
B. Experimental Setup
Since the goal of this paper is to evaluate the perfor-
mance of Spark workloads on NUMA hardware, all the
experiments are conducted in a single machine; the charac-
teristics of the machine’s architecture are described in Sec-
tion II-B. For simplicity, Spark is configured in the standalone
mode [18], To control the number of cores, memory, and
the number of executors of each worker, the parameters
SPARK_WORKER_CORES, SPARK_WORKER_MEMORY, and
parameter value
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.rdd.compress FALSE
spark.io.compression.codec lzf
storage level MEMORY AND DISK
spark.driver.maxResultSize 2 gb
spark.driver.memory 5 gb
spark.kryoserializer.buffer.max 512m
TABLE I: Spark configuration parameters
SPARK_EXECUTOR_MEMORY [18] are used, respectively. All
the other parameters and values used to configure Spark,
during the experiments execution described later in this paper,
are summarized in Table I.
Hardware counters have been used to collect most real-
time information from experimental executions, using the
perfmon2 [19] library. Memory bandwidth is calculated based
on the formula defined in [20]. For CPU usage, memory usage,
and context switches, the vmstat tool has been used. To collect
information about NUMA memory access, the numastat is
used. Finally, the numactl command has been used to bind
a workload in a set of CPUs or in memory regions (e.g. in
a NUMA node); the nB nomenclature is used to describe
different binding configurations, where n is the number of
assigned NUMA nodes, e.g. 1B for workloads bound to 1
NUMA node, 2B for workloads bound to 2 NUMA nodes,
etc.
V. EXPERIMENT 1: WORKLOAD CHARACTERIZATION
This experiment consists of a performance characterization
of Spark workloads, changing the configuration parameters of
Spark itself and observing the impact of different configura-
tions in terms of completion time and resource consumption.
The goal is to find which configurations lead to optimal
performance, e.g. which number of cores per Spark worker,
and/or the number of workers per application. Since this
experiment aims at defining the software configuration, there
is no other kind of hardware tuning involved; the OS performs
its default resource allocation decisions. More specifically, this
experiment analyzes the effect of the software configuration in
the resource usage intensiveness and possible speedups for the
workloads described in Section IV-A.
As described in Section II-B, the machine used for the
experiments has 4 NUMA nodes, 192 virtual cores, and 1TB
of main memory. Thus, this experiment varies the number
of cores, workers, and memory up to the total amount of
resources (cores and memory) available in this machine. Table
II describes all combinations of the amount of resources
allocated to all workloads in this experiment. The amount of
memory ranges from 5 up to 250GB per worker, the number
workers vary from 4 up to 192 workers. Depending on the
number of workers, the number of virtual cores ranges from 1
up to 192. If the total number of workers is 192, each one will
have only one virtual core. Note that, by creating this matrix
of experiments, we want to see at which configuration, the
operating system produce optimal results for each workload
type in terms of completion time. We assign memory to Spark

w
e
m
t
m
a
t
s
w
core per worker
1 2 3 4 6 8 12 16 24 48
250 1000 4 4 8 12 16 24 32 48 64 96 192
125 1000 8 8 16 24 32 48 64 96 128 192
83 996 12 12 24 36 48 72 96 144 192
62 992 16 16 32 48 64 96 128 192
41 984 24 24 48 72 96 144 192
31 992 32 32 64 96 128 192
20 960 48 48 96 144 192
15 960 64 64 128 192
10 960 96 96 192
5 960 192 192
TABLE II: Experiment 1: Evaluated software configurations
(wem is worker and executor memory; tma is total memory
allocated; and tsw is total Spark workers)
workload
total
workers
core
per
worker
total
cores
allocated
worker
executor
memory
total
memory
allocated
Execution
Time (sec)
SVM 24 8 192 41 984 323.71
SQL 4 12 48 250 1000 206.82
PageRank 12 8 96 83 996 748.08:
TABLE III: Experiment 1: Best configuration when optimizing
for completion time
worker and executor by dividing 1000 by a total number of
workers in the experiment and take the integral part only; thus,
the amount of memory ranges from 960GB to 1000GB. Some
amount of memory is intentionally left for the OS and other
processes (e.g spark driver, master) to avoid the slowdown
effects not related to NUMA.
In Spark, a software configuration defines the number of
workers, the number virtual cores and the amount of memory
per worker that is assigned to a specific workload. These
software resources need to match the hardware configuration
of the node used to run the workloads. But not all applications
can take advantage of an increasing amount of resources and
therefore it is not always the case that one single software
configuration optimizes the performance of a Spark Workload
for a given hardware setup.
Table III summarizes the optimal configurations that found
for the three workloads considered in this experiment. As it
can be seen, every workload achieves maximum performance
using a different software configuration, being SVM the ap-
plication that can take advantage of more threads in parallel,
followed by PageRank and finally SQL. It is remarkable that
even configurations with a similar number of cores allocated
tend to deliver different performance for similar configurations
of number of workers and number of cores per worker. For
instance, SQL works best with fewer workers and more cores
per worker and SVM gets the best performance when more
workers and fewer cores per worker are assigned. This is due
to SQL is more impacted because of thread locks and cache
contention than the SVM. Hence, SQL benefits from fewer
threads competing for resources. Additionally, because the
JVM includes additional overheads (e.g. garbage collection),
more layers for resource management and memory indexing, it
is not beneficial to have several workers with only one virtual
core.
To explain the root cause of the performance delivered
by the different configurations, Tables IV , V and VI also
show the executions times in seconds obtained for SVM ,
SQL, and PageRank respectively when using all combinations
of software configurations, but in this case, we color each
configuration according to the relative performance delivered
compared to the optimal configuration found for that particular
workload. Based on this property we classify configurations
in different groups:
Within 10% of optimal: configurations for which comple-
tion time is very close to the best execution time observed
for that particular workload.
Low CPU Usage: configurations for which CPU usage is
clearly below the observed CPU usage for the workload.
These configurations use a too low number of cores or
workers that are not enough to fully utilize the available
compute resources and produce optimal results.
High CPI and Context Switches: executions where cycles
per instruction (CPI) and context switches are greater than
the observed values for the optimal configuration. This is
due to more executors which execute more threads to pro-
cess the tasks. Moreover, executors need to communicate
with each other and also with drivers. Remote memory
access also impacts the CPI since it requires more CPU
cycles to be performed than local access. This leads to
increase in communication overhead. So in result, we see
an increase in context switches and CPI.
High L3 misses: configurations where L3 cache misses
are greater than in the optimal configuration. This group
is only defined for SVM as it is the only workload for
which this behavior was observed.
Low Memory Bandwidth: configurations where memory
bandwidth usage is less than the observed value for the
optimal configuration.
Require more investigation: configurations where values
of metrics are within the range of optimal region but
the completion time is outside of 10%. The experiments
in this region require further investigation and it could
not be determined so far the reason for the performance
differences with the optimal configuration.
core per worker
w
1 2 3 4 6 8 12 16 24 48
4 1437.95 1018.4 816.87 698.34 597.13 515.9 501.78 464.26 472.28 375.2
8 759.16 555.26 478.23 411.91 366.39 357.54 347.49 359.18 360.48
12 531.9 422.6 420.24 382.46 386.72 332.44 353.68 339.51
16 458.58 412.85 385.79 352.69 347.22 333.63 336.26
24 413.1 394.72 371.6 358.46 354.17 323.71
32 405.16 389.16 369.81 361.36 341.53 Within 10% of optimal
48 442.8 427.85 398.61 370.78 Low CPU usage
64 546.11 522.3 569.83 High CPI and Context Switches
96 1118.77 922.01 High L3 misses
192 1980.92 Require more investigation
TABLE IV: SVM completion time (seconds) groups
The second objective of this experiment was to characterize
the CPU, Memory Footprint and Memory Bandwidth demands
of each one of the workloads of study. For this purpose, we
monitored the execution of the workloads when the optimal
software configuration was in use and plotted the average
resource consumption in Figure 2. As it can be seen, results
show that memory usage is 457.7 GB, 364.3 GB and 329.4
GB for SVM, SQL and PageRank respectively and shows the

core per worker
w
1 2 3 4 6 8 12 16 24 48
4 1148.69 616.46 427.64 338.64 288.48 231.35 206.82 229.65 235.82 238.99
8 590.22 335.51 264.95 255.86 220.93 209.33 259.7 236.06 220.27
12 426.48 270.56 244.68 229.2 217.48 223.57 296.5 228.74
16 375.65 288.92 242.9 247.26 241.71 245.1 240.67
24 347.14 284.32 264.94 254.9 282.7 246.8
32 347.65 293.96 268.68 287.77 262.32
48 328.69 299.43 285.76 282.98 Within 10% of optimal
64 324.99 307.44 307.54 Low CPU usage
96 349.75 355.28 High CPI and Context Switches
192 591.11 Require more investigation
TABLE V: SQL completion time (seconds) groups
core per,worker
w
1 2 3 4 6 8 12 16 24 48
4 4771.33 2028.77 1532.34 1188.68 1016.5 1000.84 2358.19 2207.09 1733.52 3186.17
8 2517.32 1145.33 997.91 902.02 920.09 861.18 816.06 1148.35 1319.35
12 1580.25 980.71 911.1 785.84 788.52 748.08 1028.29 1010.76
16 1379.76 921.87 871.61 862.69 766.9 925.91 877.9
24 1175.22 909.71 866.08 843.5 780.68 812.44
32 1085.66 875.52 907 1095.11 880.61 Within 10% of optimal
48 996.54 858.22 760.27 767.63 Low CPU usage
64 1183.04 1143.43 912.03 High CPI and Context Switches
96 1447.05 1142.42 Low Memory Bandwidth
192 2918.32 Require more investigation
TABLE VI: PageRank completion time (seconds) groups
Fig. 2: Experiment 1: CPU Usage (percentage) and Memory
Bandwidth (GB/s) for optimal configuration
average usage of user CPU time and memory bandwidth for
these workloads when the optimal software configuration is
in use. As it can be observed, SVM is constrained by the
high CPU usage, reaching around 80% for the user CPU time
only, that when added to the system and wait CPU times tops
to about 100% CPU usage, which is the actual performance
bottleneck.
SQL is a more interesting case because CPU and Memory
Bandwidth usage are really low for the fastest configuration,
and no other resource is apparently acting as a bottleneck. In
practice, what is avoiding the total CPU usage to go higher
is the fact that the number of threads that are spawn in this
configuration (only 48) is well below the number of hardware
threads offered by the system. Therefore, only a third of
the hardware threads are in use and that is why the average
CPU utilization is shown to be low: several hardware threads
are idle. Intuition would say that increasing the number of
threads would increase the performance, but in practice what
is observed in the logs of other experiments is that as soon
as the number of threads goes higher the memory bandwidth
dramatically increases, quickly becoming the bottleneck at
many stages during the execution. The bottom line is that
the OS is not able to correctly manage the threads for this
workload, creating memory access patterns that saturate the
memory links of the P8 processor.
Finally, PageRank is in the same situation: the optimal
configuration involves 96 software threads only, while the
system offers 192 hardware threads. In practice, what it
means is that the reported CPU utilization is low. Intuition
again would point in the direction of increasing the number
of software threads, but when that direction is taken, logs
show that the additional software threads start competing for
memory bandwidth because they exhibit worse memory access
patterns, and saturate the memory links.
In summary, this experiment has shown two cases in which
not all hardware threads could be exploited because in that
case the memory access patterns across NUMA nodes were
hitting a memory bandwidth bottleneck. This is an interesting
result because opens a door to smart workload collocation
strategies that will be explored in the following experiments.
VI. EXPERIMENT 2: BINDING TO NUMA NODES
Allocating more NUMA nodes to a workload has the
potential to increase resources (such as memory bandwidth,
CPU, and memory) and possibly lower cache contention (due
to the availability of additional cores and cache), but it can also
involve a trade-off: using remote memory accesses and dealing
with bus contention, which can lead to slowdowns in some
scenarios. Binding, in this context, means Spark processes
(master and workers) will only have access to the resources
(cores and memory) of a particular set of NUMA nodes.
While the previous experiment selected the optimal software
configuration without binding, allowing the operating system
to make all decisions, this experiment selects the optimal
software configuration when binding all 4 nodes (4B). Results
are also compared to the previous experiment so as to evaluate
the impact of binding.
In the previous experiment, the OS was responsible for
allocating all the resources, in this experiment we enforce
decisions to manually bind the workloads across the NUMA
nodes. The main motivation of performing the manual binding
is to mitigate the limitations defined in the previous experi-
ments. Manually binding the workloads can exploit better load
balancing, minimize thread migrations and remote memory
access. In order to verify those assumptions, we evaluate
the completion time of all workloads for different binding
configurations and compare with non-binding approaches (the
default allocation from OS). As explained in Section VI, the
workloads are bound in one NUMA node up to 4 (1B, 2B,
3B, 4B). The results of the optimal software configuration
considering the different number of NUMA nodes is shown
in Table VII. It also shows the comparison of the manual
binding with four NUMA nodes versus the default OS resource
allocation, labeled as NB.
The optimal configurations are 24 cores per worker and
1 worker per node for SQL, SVM, and PageRank when we
bound workloads to one NUMA node (1B). In case of 2B, the
optimal configurations are 8 cores per worker and 3 workers
per node for SQL, 6 cores per worker and 6 workers per node
for SVM and 4 cores per worker and 6 workers per node for
PageRank. Similarly, in case of 3B, the optimal configurations
are 8 cores per worker and 2 workers per node for SQL, 6 cores

Citations
More filters

Journal ArticleDOI
TL;DR: This paper proposes and evaluates a system to efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain based methods and compares with polynomial regression methods for reducing and reconstructing data.
Abstract: Large-scale data centers are composed of thousands of servers organized in interconnected racks to offer services to users. These data centers continuously generate large amounts of telemetry data streams (e.g., hardware utilization metrics) used for multiple purposes, including resource management, workload characterization, resource utilization prediction, capacity planning, and real-time analytics. These telemetry streams require costly bandwidth utilization and storage space, particularly at medium-long term for large data centers. This paper addresses this problem by proposing and evaluating a system to efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain based methods. Our proposed solution was evaluated using real telemetry datasets and compared with polynomial regression methods for reducing and reconstructing data. Experimental results show that data can be lossy compressed up to $75\%$ for bandwidth utilization and $95.33\%$ for storage space, with reconstruction accuracy close to $92\%$ .

6 citations


Cites methods from "Performance Characterization of Spa..."

  • ...We used IBM POWER8 telemetry logs dataset [45] to evaluate our proposed method for telemetry reduction and recon-...

    [...]


Dissertation
29 Apr 2019
TL;DR: This thesis aims to find the right techniques and design decisions to build a scalable distributed system for the IoT under the Fog Computing paradigm to ingest and process data and explore the trade-offs and challenges in the design of a solution from Edge to Cloud to address opportunities that current and future technologies will bring in an integrated way.
Abstract: In recent years there has been an extraordinary growth of the Internet of Things (IoT) and its protocols. The increasing diffusion of electronic devices with identification, computing and communication capabilities is laying ground for the emergence of a highly distributed service and networking environment. The above mentioned situation implies that there is an increasing demand for advanced IoT data management and processing platforms. Such platforms require support for multiple protocols at the edge for extended connectivity with the objects, but also need to exhibit uniform internal data organization and advanced data processing capabilities to fulfill the demands of the application and services that consume IoT data. One of the initial approaches to address this demand is the integration between IoT and the Cloud computing paradigm. There are many benefits of integrating IoT with Cloud computing. The IoT generates massive amounts of data, and Cloud computing provides a pathway for that data to travel to its destination. But today’s Cloud computing models do not quite fit for the volume, variety, and velocity of data that the IoT generates. Among the new technologies emerging around the Internet of Things to provide a new whole scenario, the Fog Computing paradigm has become the most relevant. Fog computing was introduced a few years ago in response to challenges posed by many IoT applications, including requirements such as very low latency, real-time operation, large geo-distribution, and mobility. Also this low latency, geo-distributed and mobility environments are covered by the network architecture MEC (Mobile Edge Computing) that provides an IT service environment and Cloud-computing capabilities at the edge of the mobile network, within the Radio Access Network (RAN) and in close proximity to mobile subscribers. Fog computing addresses use cases with requirements far beyond Cloud-only solution capabilities. The interplay between Cloud and Fog computing is crucial for the evolution of the so-called IoT, but the reach and specification of such interplay is an open problem. This thesis aims to find the right techniques and design decisions to build a scalable distributed system for the IoT under the Fog Computing paradigm to ingest and process data. The final goal is to explore the trade-offs and challenges in the design of a solution from Edge to Cloud to address opportunities that current and future technologies will bring in an integrated way. This thesis describes an architectural approach that addresses some of the technical challenges behind the convergence between IoT, Cloud and Fog with special focus on bridging the gap between Cloud and Fog. To that end, new models and techniques are introduced in order to explore solutions for IoT environments. This thesis contributes to the architectural proposals for IoT ingestion and data processing by 1) proposing the characterization of a platform for hosting IoT workloads in the Cloud providing multi-tenant data stream processing capabilities, the interfaces over an advanced data-centric technology, including the building of a state-of-the-art infrastructure to evaluate the performance and to validate the proposed solution. 2) studying an architectural approach following the Fog paradigm that addresses some of the technical challenges found in the first contribution. The idea is to study an extension of the model that addresses some of the central challenges behind the converge of Fog and IoT. 3) Design a distributed and scalable platform to perform IoT operations in a moving data environment. The idea after study data processing in Cloud, and after study the convenience of the Fog paradigm to solve the IoT close to the Edge challenges, is to define the protocols, the interfaces and the data management to solve the ingestion and processing of data in a distributed and orchestrated manner for the Fog Computing paradigm for IoT in a moving data environment.

2 citations


Cites background from "Performance Characterization of Spa..."

  • ...[122] Shuja-ur-Rehman Baig, Marcelo Amaral, Jordà Polo and David Carrera, "Performance Characterization of Spark Workloads on Shared NUMA Systems," 2018 IEEE Fourth International Conference on Big Data Computing Service and Applications (BigDataService), Bamberg, 2018, pp....

    [...]

  • ...[122] Shuja ur Rehman Baig, Marcelo Amaral, Jordà Polo, and David Carrera....

    [...]


Journal ArticleDOI
Chunlin Li1, Chunlin Li2, Jun Liu2, Weigang Li1  +1 moreInstitutions (2)
Abstract: With the rapid development and the widespread use of cloud computing in various applications, the number of users distributed in different regions has grown exponentially. Therefore, the Geo-distributed cloud systems have become a research hotspot and big data processing technology has also emerged. Nowadays, the most widely used big data processing framework is Spark. However, massive amounts of data are generated every moment, and the processing procedure becomes more and more complex, the execution efficiency of Spark has been greatly affected. In the Spark frame of geo-distributed cloud systems, aiming at the data placement problem, the data placement strategy based on RDD dynamic weight is introduced. The target node is selected with a strong computation capacity to place the data. Aiming at the problems of multi-task scheduling, a task will be scheduled to a node whose computation capacity can satisfy the requirement of this task. And then considering job classification and computing node performance, the optimized task scheduling strategy is in traduced. Experiments show that our algorithms can effectively adjust the weight of node data placement according to the actual task execution information, and shorten the task execution time.

1 citations



References
More filters

Proceedings Article
22 Jun 2010
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Abstract: MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

4,645 citations


Proceedings Article
Matei Zaharia1, Mosharaf Chowdhury1, Tathagata Das1, Ankur Dave1  +5 moreInstitutions (1)
25 Apr 2012
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Abstract: We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

3,903 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...Spark uses Resilient Distributed Datasets (RDDs) [3] which are immutable collections of objects spread across a cluster and hides the details of distribution and fault-tolerance for a larger collection of items....

    [...]


Proceedings Article
15 Jun 2011
TL;DR: The effects on performance imposed by resource contention and remote access latency are quantified and a new contention management algorithm is proposed and evaluated that significantly outperforms a NUMA-unaware algorithm proposed before as well as the default Linux scheduler.
Abstract: On multicore systems, contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, to mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that state-of-the-art contention management algorithms fail to be effective on NUMA systems and may even hurt performance relative to a default OS scheduler. In this paper we investigate the causes for this behavior and design the first contention-aware algorithm for NUMA systems.

252 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...A process running on a NUMA system can access the memory of its own node as well remote node where the latency of memory accesses on remote nodes is significantly high compared to local memory accesses [1]....

    [...]


Proceedings ArticleDOI
Kaiyuan Zhang1, Rong Chen1, Haibo Chen1Institutions (1)
24 Jan 2015
TL;DR: Polymer is described, a NUMA-aware graph-analytics system on multicore with two key design decisions, which shows that Polymer often outperforms the state-of-the-art single-machine graph-Analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.
Abstract: Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing large-scale graphs on a single server-class machine, which may have several terabytes of RAM and 80 or more cores. However, prior graph-analytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graph-analytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential inter-node (i.e., remote) memory accesses have much higher bandwidth than both intra- and inter-node random ones. Based on them, this paper describes Polymer, a NUMA-aware graph-analytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, application-defined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edge-oriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80-core machine shows that Polymer often outperforms the state-of-the-art single-machine graph-analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.

163 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...Another example is the work of [14], where the authors...

    [...]


Proceedings ArticleDOI
Min Li1, Jian Tan1, Yandong Wang1, Li Zhang1  +1 moreInstitutions (1)
06 May 2015
TL;DR: This paper presents SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications, including machine learning, graph computation, SQL query and streaming applications, and evaluates the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.
Abstract: Spark has been increasingly adopted by industries in recent years for big data analysis by providing a fault tolerant, scalable and easy-to-use in memory abstraction. Moreover, the community has been actively developing a rich ecosystem around Spark, making it even more attractive. However, there is not yet a Spark specify benchmark existing in the literature to guide the development and cluster deployment of Spark to better fit resource demands of user applications. In this paper, we present SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications. SparkBench covers four main categories of applications, including machine learning, graph computation, SQL query and streaming applications. We also characterize the resource consumption, data flow and timing information of each application and evaluate the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.

153 citations


"Performance Characterization of Spa..." refers methods in this paper

  • ...The experiments presented in this paper are based on SparkBench [17], which is a benchmark suite developed by IBM and widely tested in Power8 systems....

    [...]


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20212
20192