scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Performance Characterization of Spark Workloads on Shared NUMA Systems

TL;DR: This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes.
Abstract: As the adoption of Big Data technologies becomes the norm in an increasing number of scenarios, there is also a growing need to optimize them for modern processors. Spark has gained momentum over the last few years among companies looking for high performance solutions that can scale out across different cluster sizes. At the same time, modern processors can be connected to large amounts of physical memory, in the range of up to few terabytes. This opens an enormous range of opportunities for runtimes and applications that aim to improve their performance by leveraging low latencies and high bandwidth provided by RAM. The result is that there are several examples today of applications that have started pushing the in-memory computing paradigm to accelerate tasks. To deliver such a large physical memory capacity, hardware vendors have leveraged Non-Uniform Memory Architectures (NUMA). This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes. We explore several workloads run on top of the IBM Power8 processor, and provide manual strategies that can leverage performance improvements up to 40% on Spark workloads when using smart processor-pinning and workload collocation strategies.

Summary (3 min read)

Introduction

  • This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes.
  • To achieve a good performance for in-memory computing frameworks on a NUMA system, there is a need to understand the topology of the interconnect between processor sockets and memory banks.
  • This paper explores how Spark-based in-memory computing workloads are impacted by the effects of NUMA architecture, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to leverage memory-consumption patterns to smartly co-locate workloads in scale-up nodes.
  • Section IV introduces the evaluation methodology used for the experiments.

A. Workloads

  • The experiments presented in this paper are based on SparkBench [17], which is a benchmark suite developed by IBM and widely tested in Power8 systems.
  • From the range of available workloads provided by the benchmark, Support Vector Machines (SVM), PageRank, and Spark SQL RDDRelation have been selected for the evaluation.
  • These workloads are wellknown in the literature, and combine different characteristics to cover a large range of possible configurations.

B. Experimental Setup

  • Since the goal of this paper is to evaluate the performance of Spark workloads on NUMA hardware, all the experiments are conducted in a single machine; the characteristics of the machine’s architecture are described in Section II-B.
  • All the other parameters and values used to configure Spark, during the experiments execution described later in this paper, are summarized in Table I. Hardware counters have been used to collect most realtime information from experimental executions, using the perfmon2 [19] library.
  • Memory bandwidth is calculated based on the formula defined in [20].
  • For CPU usage, memory usage, and context switches, the vmstat tool has been used.
  • To collect information about NUMA memory access, the numastat is used.

V. EXPERIMENT 1: WORKLOAD CHARACTERIZATION

  • This experiment consists of a performance characterization of Spark workloads, changing the configuration parameters of Spark itself and observing the impact of different configurations in terms of completion time and resource consumption.
  • More specifically, this experiment analyzes the effect of the software configuration in the resource usage intensiveness and possible speedups for the workloads described in Section IV-A.
  • This is due to SQL is more impacted because of thread locks and cache contention than the SVM.
  • Based on this property the authors classify configurations in different groups: Within 10% of optimal: configurations for which completion time is very close to the best execution time observed for that particular workload.
  • This is due to more executors which execute more threads to process the tasks.

VI. EXPERIMENT 2: BINDING TO NUMA NODES

  • Allocating more NUMA nodes to a workload has the potential to increase resources (such as memory bandwidth, CPU, and memory) and possibly lower cache contention (due to the availability of additional cores and cache), but it can also involve a trade-off: using remote memory accesses and dealing with bus contention, which can lead to slowdowns in some scenarios.
  • In case of 2B, the optimal configurations are 8 cores per worker and 3 workers per node for SQL, 6 cores per worker and 6 workers per node for SVM and 4 cores per worker and 6 workers per node for PageRank.
  • The results of this experiment, as summarized in Table VII, show a significant speedup when comparing manual binding versus the OS allocating the resources, but not for all workloads.
  • The results of this experiment also show that how applications scale with more NUMA nodes.

VII. EXPERIMENT 3: WORKLOAD CO-SCHEDULING

  • This final experiment explores the benefits of workload colocation and process binding (cores and memory) as a mechanism to improve system throughput and increase resource utilization.
  • This experiment, therefore, evaluates the performance impact on workloads when sharing the same machine, that is when workloads are co-located.
  • The authors repeat the process with the 1B-3B configurations (1B for 1 NUMA node with binding, and 3B for 3 NUMA nodes with binding), in which one workload gets assigned one NUMA node while the other gets allocated the other three nodes.
  • In all cases the authors executed all combinations of SQL-SVM, SQL-PageRank, SVM-PageRank.
  • In a case of remote memory access, there is an increase of 80-91.93% when the same experiments are executed without binding.

VIII. CONCLUSIONS

  • In-memory computing is becoming one of the most popular approaches for real-time big data processing as data sets grow and more memory capacity is made available to popular runtimes such as Spark.
  • To deliver large physical memory capacity, modern processors feature Non-Uniform Memory Architectures (NUMA).
  • Each socket can have multiple processors with its own memory.
  • Large sets of experiments were executed to evaluate several Spark workloads, and the results demonstrated that workload colocation is a smart strategy to improve resource utilization for memory-intensive workloads placed in modern NUMA processors.
  • Highly concurrent configurations produce undesired memory access patterns across NUMA nodes that push to the limit the existing memory bandwidth, making co-scheduling a good choice.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Performance Characterization of Spark Workloads
on Shared NUMA Systems
Shuja-ur-Rehman Baig
, Marcelo Amaral
, Jord
`
a Polo
, David Carrera
{shuja.baig, marcelo.amaral, jorda.polo, david.carrera}@bsc.es
Universitat Polit
`
ecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)
University of the Punjab (PU)
Abstract—As the adoption of Big Data technologies becomes
the norm in an increasing number of scenarios, there is also a
growing need to optimize them for modern processors. Spark
has gained momentum over the last few years among companies
looking for high performance solutions that can scale out across
different cluster sizes. At the same time, modern processors
can be connected to large amounts of physical memory, in
the range of up to few terabytes. This opens an enormous
range of opportunities for runtimes and applications that aim
to improve their performance by leveraging low latencies and
high bandwidth provided by RAM. The result is that there are
several examples today of applications that have started pushing
the in-memory computing paradigm to accelerate tasks. To deliver
such a large physical memory capacity, hardware vendors have
leveraged Non-Uniform Memory Architectures (NUMA).
This paper explores how Spark-based workloads are impacted
by the effects of NUMA-placement decisions, how different Spark
configurations result in changes in delivered performance, how
the characteristics of the applications can be used to predict
workload collocation conflicts, and how to improve performance
by collocating workloads in scale-up nodes. We explore several
workloads run on top of the IBM Power8 processor, and provide
manual strategies that can leverage performance improvements
up to 40% on Spark workloads when using smart processor-
pinning and workload collocation strategies.
Index Terms—Performance, Modeling, Characterization,
Memory, NUMA, Spark, Benchmark
I. INTRODUCTION
Nowadays, due to the growth in the number of cores in
modern processors, parallel systems are built using Non-
Uniform Memory Architecture (NUMA), which has gained
wide acceptance in the industry, setting the new standard for
building new generation enterprise servers. These processors
can be connected to large amounts of physical memory, in the
range of up to a couple of terabytes for the time being. This
opens an enormous range of opportunities for runtimes and
applications that aim to improve their performance by leverag-
ing low latencies and high bandwidth provided by RAM. The
result is that today there are several examples of applications
that have started pushing the in-memory computing paradigm
to accelerate tasks.
To deliver such a large physical memory capacity, sockets
in NUMA systems are connected through high performance
connections and each socket can have multiple processors
with its own memory. A process running on a NUMA system
can access the memory of its own node as well remote node
where the latency of memory accesses on remote nodes is
significantly high compared to local memory accesses [1].
Ideally, memory accesses are kept local in order to avoid
this latency and contention on interconnect links. Moreover,
the bandwidth of memory accesses to last-level caches and
DRAM memory also depends on the access type that is
local or remote. From the NUMA perspective, people want
to learn whether NUMA topology can meet the challenges of
in-memory computing frameworks, and if not, what kinds of
optimizations are required.
At the same time, as the adoption of Big Data technologies
becomes the norm in many scenarios, there is a growing
need to optimize them for modern processors. Spark [2] has
gained momentum over the last few years among companies
looking for high performance solutions that can scale out
across different cluster sizes. To achieve a good performance
for in-memory computing frameworks on a NUMA system,
there is a need to understand the topology of the interconnect
between processor sockets and memory banks. Additionally,
while a NUMA architecture can provide very high memory
capacity to the applications, it also implies the additional
complexity of letting the Operating System take care of many
critical decisions with respect to where data is physically
stored and where are processes accessing that data placed. This
fact may have no impact for many applications that are not
memory intensive, whereas memory-bound applications can be
seriously impacted by these decisions in terms of performance.
This paper explores how Spark-based in-memory computing
workloads are impacted by the effects of NUMA architecture,
how different Spark configurations result in changes in deliv-
ered performance, how the characteristics of the applications
can be used to predict workload collocation conflicts, and
how to leverage memory-consumption patterns to smartly
co-locate workloads in scale-up nodes. The evaluation also
characterizes several workloads running on top of the IBM
Power8 processor, and provides strategies that can lead to
performance improvements of up to 40% on Spark workloads
when using smart processor-pinning and workload collocation
strategies.
In summary, the main contributions of this paper are the
following:
A characterization of three representative memory-
intensive Spark workloads across multiple software con-
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/
republishing this material for advertising
or promotional purposes, creating new collective
works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works.

figurations on top of a modern NUMA processor (IBM
Power 8). The study illustrates how these workloads re-
quire a high level of concurrence to achieve full processor
utilization when running in isolation, but at the same
time they are limited by the processor memory bandwidth
when such levels of concurrence is reached, resulting in
poor performance and resource underutilization.
An evaluation of how process and memory binding
to NUMA nodes can provide means to improve the
performance of memory-intensive Spark workloads by
reducing the competition among software threads on high
concurrency situations.
Propose smart workload co-location strategies that lever-
age process and memory binding to NUMA nodes as
a mechanism to improve performance and resource uti-
lization for modern NUMA processors running memory-
intensive workloads.
The rest of the paper is organized as follows. Sections II
and III provide technological background as well as set
the state of the art for the work presented in this paper.
Section IV introduces the evaluation methodology used for
the experiments. Sections V, VI and VII describe experiments
performed to support the conclusions presented in this work.
Finally, Section VIII presents the conclusions to the work.
II. BACKGROUND
A. Apache Spark
Apache Spark [2] is a popular open-source platform for
large-scale in-memory data processing developed at UC
Berkeley. Spark is designed to avoid the file system as much
as possible, retaining most data resident in distributed memory
across phases in the same job. Spark uses Resilient Distributed
Datasets (RDDs) [3] which are immutable collections of ob-
jects spread across a cluster and hides the details of distribution
and fault-tolerance for a larger collection of items. Spark
provides a programming interface based on the two high-
level operations i) transformations ii) actions. Transformations
are lazy operations that create new RDDs. Actions launch a
computation on RDDs to return a value to the application or
write data to the distributed storage system. When a user runs
an action on an RDD, Spark first builds a directed acyclic
graph (DAG) of stages. Next, it splits the DAG into stages that
contain pipelined transformations with dependencies. Further,
it divides each stage into tasks. A task is a combination of
data and computation. Spark executes all the tasks of a stage
before going to next stage. Spark maintains a pool of executor
threads which are used to execute the tasks in parallel.
B. IBM Power 8
We run all the experiments in the IBM Power System 8247-
42L, which is 2-way 12-core IBM Power8 machine with all
cores at 3.02GHz and with each core able to run up to 8
Simultaneous Multi-Threading (SMT) [4]
Each Power8 processor is composed of two memory regions
(i.e NUMA node) with 6 cores and their own memory con-
troller and 256GB of RAM. The Power8 processor includes
four cache levels, consisting of a store-through L1 data cache,
a store-in L2 cache, an eDRAM-based L3 cache with a per-
core capacity of 64 KB, 512KB, and 8 MB, respectively. The
fourth cache level has 128 MB and consists of eight external
memory chips called Centaur (which is a DDR3 memory).
1
2
3 6
5
4
230.4 GB/s
8 * 19.2 = 153.6 GB/s read; 8 * 9.6 =76.8 GB/s write;
mem. bandwidth per socket=230.4 GB/s
256 GB
1
2
3 6
5
4
256 GB
1
2
3 6
5
4
230.4 GB/s
256 GB
1
2
3 6
5
4
256 GB
Fig. 1: Power8 NUMA architecture
Because of the main memory is connected to the processor
using separate links for reading and write operations, with
two links for memory reads and one link for memory writes,
the system has asymmetric read and write bandwidth. The
connection between the processor with the memory is com-
posed of 8 links, with a link offering 9.6 GB/s write and 19.2
GB/s read bandwidth. Therefore, the system has in total four
NUMA nodes, 192 virtual cores, 1 TB of RAM and a total of
230.4 GB/s of sustainable memory bandwidth per socket, as
illustrated in Figure 1. For the software stack, the machine is
configured with a Ubuntu 16.10, kernel 4.4.0-47-generic, IBM
java version 1.8.0 and Apache Spark 1.6.1.
III. RELATED WORK
Although there have been several research efforts to in-
vestigate and mitigate the impact of NUMA on workload
performance, this topic is still gaining traction in the literature
in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]. These
works characterize some key sources of workload performance
degradation related to NUMA (such as additional latencies
and possibly lower memory bandwidth because of remote
memory access and contentions on the memory controller,
bus connections or cache contention), and propose OS task
placement strategies for mitigating remote memory access.
But the characterization of Spark workloads on IBM Power8
systems and placement strategies for co-scheduled application
is still roughly understood.
For example, [6] has characterized the NUMA performance
impact related to remote memory access introduced by the OS
when performing task re-scheduling or load balancing. While
this work proposed an effective approach to mitigating remote
memory access and cache contention, it is not application-
driven and does not have a holistic view of all applications to
define efficient workloads co-scheduling on NUMA systems.
In our work, we demonstrate the potential benefits from man-
ual binding strategies when co-scheduling multiple workloads
on NUMA systems.

Another example is the work of [14], where the authors
characterized the performance impact of NUMA on graph-
analytics applications. They present an approach to minimize
remote memory access by using graph-aware data allocation
and access strategies. While this work presents an application-
driven investigation, it lacks the analyze of memory-intensive
Spark workloads and workload collocation.
The most related works to our work are the [15] and
[16]. In the former, the authors quantify the impact of data
locality on NUMA nodes for Spark workloads on Intel Ivy
Bridge server. In the later, the authors evaluate the impact
of NUMA locality on the performance of in-memory data
analytics with Spark on Intel Ivy Bridge server. In both
papers, they run benchmarks with two configurations a) Local
DRAM b) Remote DRAM. In Local DRAM, they bound Spark
process to processor 0 and memory node 0 and in Remote
DRAM, they bound the Spark process to processor 0 and
memory node 1. Then, they compare the results to evaluate the
performance impact of NUMA. While those works present a
detailed performance characterization of Spark workloads on
NUMA systems on an Intel Ivy Bridge server, the NUMA per-
formance characterization of IBM Power8 systems is still not
understood. Moreover, their work does not present the NUMA
impact of optimally binding the workloads versus leaving the
OS allocating the resources. Also, they do not evaluate the
performance benefits of performing manual binding for co-
scheduled Spark workloads as we present in this paper.
IV. METHODOLOGY
This section describes how the study on the impact of
NUMA topology on in-memory data analytics workloads has
been performed, as well as the rationale behind the experi-
ments evaluated in the following sections.
A. Workloads
The experiments presented in this paper are based on Spark-
Bench [17], which is a benchmark suite developed by IBM and
widely tested in Power8 systems. From the range of available
workloads provided by the benchmark, Support Vector Ma-
chines (SVM), PageRank, and Spark SQL RDDRelation have
been selected for the evaluation. These workloads are well-
known in the literature, and combine different characteristics
to cover a large range of possible configurations. Dataset size
for SVM, SQL and PageRank is 47, 24, and 17 GB and
number of partitions are 376, 192 and 136 respectively for
all experiments.
B. Experimental Setup
Since the goal of this paper is to evaluate the perfor-
mance of Spark workloads on NUMA hardware, all the
experiments are conducted in a single machine; the charac-
teristics of the machine’s architecture are described in Sec-
tion II-B. For simplicity, Spark is configured in the standalone
mode [18], To control the number of cores, memory, and
the number of executors of each worker, the parameters
SPARK_WORKER_CORES, SPARK_WORKER_MEMORY, and
parameter value
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.rdd.compress FALSE
spark.io.compression.codec lzf
storage level MEMORY AND DISK
spark.driver.maxResultSize 2 gb
spark.driver.memory 5 gb
spark.kryoserializer.buffer.max 512m
TABLE I: Spark configuration parameters
SPARK_EXECUTOR_MEMORY [18] are used, respectively. All
the other parameters and values used to configure Spark,
during the experiments execution described later in this paper,
are summarized in Table I.
Hardware counters have been used to collect most real-
time information from experimental executions, using the
perfmon2 [19] library. Memory bandwidth is calculated based
on the formula defined in [20]. For CPU usage, memory usage,
and context switches, the vmstat tool has been used. To collect
information about NUMA memory access, the numastat is
used. Finally, the numactl command has been used to bind
a workload in a set of CPUs or in memory regions (e.g. in
a NUMA node); the nB nomenclature is used to describe
different binding configurations, where n is the number of
assigned NUMA nodes, e.g. 1B for workloads bound to 1
NUMA node, 2B for workloads bound to 2 NUMA nodes,
etc.
V. EXPERIMENT 1: WORKLOAD CHARACTERIZATION
This experiment consists of a performance characterization
of Spark workloads, changing the configuration parameters of
Spark itself and observing the impact of different configura-
tions in terms of completion time and resource consumption.
The goal is to find which configurations lead to optimal
performance, e.g. which number of cores per Spark worker,
and/or the number of workers per application. Since this
experiment aims at defining the software configuration, there
is no other kind of hardware tuning involved; the OS performs
its default resource allocation decisions. More specifically, this
experiment analyzes the effect of the software configuration in
the resource usage intensiveness and possible speedups for the
workloads described in Section IV-A.
As described in Section II-B, the machine used for the
experiments has 4 NUMA nodes, 192 virtual cores, and 1TB
of main memory. Thus, this experiment varies the number
of cores, workers, and memory up to the total amount of
resources (cores and memory) available in this machine. Table
II describes all combinations of the amount of resources
allocated to all workloads in this experiment. The amount of
memory ranges from 5 up to 250GB per worker, the number
workers vary from 4 up to 192 workers. Depending on the
number of workers, the number of virtual cores ranges from 1
up to 192. If the total number of workers is 192, each one will
have only one virtual core. Note that, by creating this matrix
of experiments, we want to see at which configuration, the
operating system produce optimal results for each workload
type in terms of completion time. We assign memory to Spark

w
e
m
t
m
a
t
s
w
core per worker
1 2 3 4 6 8 12 16 24 48
250 1000 4 4 8 12 16 24 32 48 64 96 192
125 1000 8 8 16 24 32 48 64 96 128 192
83 996 12 12 24 36 48 72 96 144 192
62 992 16 16 32 48 64 96 128 192
41 984 24 24 48 72 96 144 192
31 992 32 32 64 96 128 192
20 960 48 48 96 144 192
15 960 64 64 128 192
10 960 96 96 192
5 960 192 192
TABLE II: Experiment 1: Evaluated software configurations
(wem is worker and executor memory; tma is total memory
allocated; and tsw is total Spark workers)
workload
total
workers
core
per
worker
total
cores
allocated
worker
executor
memory
total
memory
allocated
Execution
Time (sec)
SVM 24 8 192 41 984 323.71
SQL 4 12 48 250 1000 206.82
PageRank 12 8 96 83 996 748.08:
TABLE III: Experiment 1: Best configuration when optimizing
for completion time
worker and executor by dividing 1000 by a total number of
workers in the experiment and take the integral part only; thus,
the amount of memory ranges from 960GB to 1000GB. Some
amount of memory is intentionally left for the OS and other
processes (e.g spark driver, master) to avoid the slowdown
effects not related to NUMA.
In Spark, a software configuration defines the number of
workers, the number virtual cores and the amount of memory
per worker that is assigned to a specific workload. These
software resources need to match the hardware configuration
of the node used to run the workloads. But not all applications
can take advantage of an increasing amount of resources and
therefore it is not always the case that one single software
configuration optimizes the performance of a Spark Workload
for a given hardware setup.
Table III summarizes the optimal configurations that found
for the three workloads considered in this experiment. As it
can be seen, every workload achieves maximum performance
using a different software configuration, being SVM the ap-
plication that can take advantage of more threads in parallel,
followed by PageRank and finally SQL. It is remarkable that
even configurations with a similar number of cores allocated
tend to deliver different performance for similar configurations
of number of workers and number of cores per worker. For
instance, SQL works best with fewer workers and more cores
per worker and SVM gets the best performance when more
workers and fewer cores per worker are assigned. This is due
to SQL is more impacted because of thread locks and cache
contention than the SVM. Hence, SQL benefits from fewer
threads competing for resources. Additionally, because the
JVM includes additional overheads (e.g. garbage collection),
more layers for resource management and memory indexing, it
is not beneficial to have several workers with only one virtual
core.
To explain the root cause of the performance delivered
by the different configurations, Tables IV , V and VI also
show the executions times in seconds obtained for SVM ,
SQL, and PageRank respectively when using all combinations
of software configurations, but in this case, we color each
configuration according to the relative performance delivered
compared to the optimal configuration found for that particular
workload. Based on this property we classify configurations
in different groups:
Within 10% of optimal: configurations for which comple-
tion time is very close to the best execution time observed
for that particular workload.
Low CPU Usage: configurations for which CPU usage is
clearly below the observed CPU usage for the workload.
These configurations use a too low number of cores or
workers that are not enough to fully utilize the available
compute resources and produce optimal results.
High CPI and Context Switches: executions where cycles
per instruction (CPI) and context switches are greater than
the observed values for the optimal configuration. This is
due to more executors which execute more threads to pro-
cess the tasks. Moreover, executors need to communicate
with each other and also with drivers. Remote memory
access also impacts the CPI since it requires more CPU
cycles to be performed than local access. This leads to
increase in communication overhead. So in result, we see
an increase in context switches and CPI.
High L3 misses: configurations where L3 cache misses
are greater than in the optimal configuration. This group
is only defined for SVM as it is the only workload for
which this behavior was observed.
Low Memory Bandwidth: configurations where memory
bandwidth usage is less than the observed value for the
optimal configuration.
Require more investigation: configurations where values
of metrics are within the range of optimal region but
the completion time is outside of 10%. The experiments
in this region require further investigation and it could
not be determined so far the reason for the performance
differences with the optimal configuration.
core per worker
w
1 2 3 4 6 8 12 16 24 48
4 1437.95 1018.4 816.87 698.34 597.13 515.9 501.78 464.26 472.28 375.2
8 759.16 555.26 478.23 411.91 366.39 357.54 347.49 359.18 360.48
12 531.9 422.6 420.24 382.46 386.72 332.44 353.68 339.51
16 458.58 412.85 385.79 352.69 347.22 333.63 336.26
24 413.1 394.72 371.6 358.46 354.17 323.71
32 405.16 389.16 369.81 361.36 341.53 Within 10% of optimal
48 442.8 427.85 398.61 370.78 Low CPU usage
64 546.11 522.3 569.83 High CPI and Context Switches
96 1118.77 922.01 High L3 misses
192 1980.92 Require more investigation
TABLE IV: SVM completion time (seconds) groups
The second objective of this experiment was to characterize
the CPU, Memory Footprint and Memory Bandwidth demands
of each one of the workloads of study. For this purpose, we
monitored the execution of the workloads when the optimal
software configuration was in use and plotted the average
resource consumption in Figure 2. As it can be seen, results
show that memory usage is 457.7 GB, 364.3 GB and 329.4
GB for SVM, SQL and PageRank respectively and shows the

core per worker
w
1 2 3 4 6 8 12 16 24 48
4 1148.69 616.46 427.64 338.64 288.48 231.35 206.82 229.65 235.82 238.99
8 590.22 335.51 264.95 255.86 220.93 209.33 259.7 236.06 220.27
12 426.48 270.56 244.68 229.2 217.48 223.57 296.5 228.74
16 375.65 288.92 242.9 247.26 241.71 245.1 240.67
24 347.14 284.32 264.94 254.9 282.7 246.8
32 347.65 293.96 268.68 287.77 262.32
48 328.69 299.43 285.76 282.98 Within 10% of optimal
64 324.99 307.44 307.54 Low CPU usage
96 349.75 355.28 High CPI and Context Switches
192 591.11 Require more investigation
TABLE V: SQL completion time (seconds) groups
core per,worker
w
1 2 3 4 6 8 12 16 24 48
4 4771.33 2028.77 1532.34 1188.68 1016.5 1000.84 2358.19 2207.09 1733.52 3186.17
8 2517.32 1145.33 997.91 902.02 920.09 861.18 816.06 1148.35 1319.35
12 1580.25 980.71 911.1 785.84 788.52 748.08 1028.29 1010.76
16 1379.76 921.87 871.61 862.69 766.9 925.91 877.9
24 1175.22 909.71 866.08 843.5 780.68 812.44
32 1085.66 875.52 907 1095.11 880.61 Within 10% of optimal
48 996.54 858.22 760.27 767.63 Low CPU usage
64 1183.04 1143.43 912.03 High CPI and Context Switches
96 1447.05 1142.42 Low Memory Bandwidth
192 2918.32 Require more investigation
TABLE VI: PageRank completion time (seconds) groups
Fig. 2: Experiment 1: CPU Usage (percentage) and Memory
Bandwidth (GB/s) for optimal configuration
average usage of user CPU time and memory bandwidth for
these workloads when the optimal software configuration is
in use. As it can be observed, SVM is constrained by the
high CPU usage, reaching around 80% for the user CPU time
only, that when added to the system and wait CPU times tops
to about 100% CPU usage, which is the actual performance
bottleneck.
SQL is a more interesting case because CPU and Memory
Bandwidth usage are really low for the fastest configuration,
and no other resource is apparently acting as a bottleneck. In
practice, what is avoiding the total CPU usage to go higher
is the fact that the number of threads that are spawn in this
configuration (only 48) is well below the number of hardware
threads offered by the system. Therefore, only a third of
the hardware threads are in use and that is why the average
CPU utilization is shown to be low: several hardware threads
are idle. Intuition would say that increasing the number of
threads would increase the performance, but in practice what
is observed in the logs of other experiments is that as soon
as the number of threads goes higher the memory bandwidth
dramatically increases, quickly becoming the bottleneck at
many stages during the execution. The bottom line is that
the OS is not able to correctly manage the threads for this
workload, creating memory access patterns that saturate the
memory links of the P8 processor.
Finally, PageRank is in the same situation: the optimal
configuration involves 96 software threads only, while the
system offers 192 hardware threads. In practice, what it
means is that the reported CPU utilization is low. Intuition
again would point in the direction of increasing the number
of software threads, but when that direction is taken, logs
show that the additional software threads start competing for
memory bandwidth because they exhibit worse memory access
patterns, and saturate the memory links.
In summary, this experiment has shown two cases in which
not all hardware threads could be exploited because in that
case the memory access patterns across NUMA nodes were
hitting a memory bandwidth bottleneck. This is an interesting
result because opens a door to smart workload collocation
strategies that will be explored in the following experiments.
VI. EXPERIMENT 2: BINDING TO NUMA NODES
Allocating more NUMA nodes to a workload has the
potential to increase resources (such as memory bandwidth,
CPU, and memory) and possibly lower cache contention (due
to the availability of additional cores and cache), but it can also
involve a trade-off: using remote memory accesses and dealing
with bus contention, which can lead to slowdowns in some
scenarios. Binding, in this context, means Spark processes
(master and workers) will only have access to the resources
(cores and memory) of a particular set of NUMA nodes.
While the previous experiment selected the optimal software
configuration without binding, allowing the operating system
to make all decisions, this experiment selects the optimal
software configuration when binding all 4 nodes (4B). Results
are also compared to the previous experiment so as to evaluate
the impact of binding.
In the previous experiment, the OS was responsible for
allocating all the resources, in this experiment we enforce
decisions to manually bind the workloads across the NUMA
nodes. The main motivation of performing the manual binding
is to mitigate the limitations defined in the previous experi-
ments. Manually binding the workloads can exploit better load
balancing, minimize thread migrations and remote memory
access. In order to verify those assumptions, we evaluate
the completion time of all workloads for different binding
configurations and compare with non-binding approaches (the
default allocation from OS). As explained in Section VI, the
workloads are bound in one NUMA node up to 4 (1B, 2B,
3B, 4B). The results of the optimal software configuration
considering the different number of NUMA nodes is shown
in Table VII. It also shows the comparison of the manual
binding with four NUMA nodes versus the default OS resource
allocation, labeled as NB.
The optimal configurations are 24 cores per worker and
1 worker per node for SQL, SVM, and PageRank when we
bound workloads to one NUMA node (1B). In case of 2B, the
optimal configurations are 8 cores per worker and 3 workers
per node for SQL, 6 cores per worker and 6 workers per node
for SVM and 4 cores per worker and 6 workers per node for
PageRank. Similarly, in case of 3B, the optimal configurations
are 8 cores per worker and 2 workers per node for SQL, 6 cores

Citations
More filters
Journal ArticleDOI
TL;DR: This paper proposes and evaluates a system to efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain based methods and compares with polynomial regression methods for reducing and reconstructing data.
Abstract: Large-scale data centers are composed of thousands of servers organized in interconnected racks to offer services to users. These data centers continuously generate large amounts of telemetry data streams (e.g., hardware utilization metrics) used for multiple purposes, including resource management, workload characterization, resource utilization prediction, capacity planning, and real-time analytics. These telemetry streams require costly bandwidth utilization and storage space, particularly at medium-long term for large data centers. This paper addresses this problem by proposing and evaluating a system to efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain based methods. Our proposed solution was evaluated using real telemetry datasets and compared with polynomial regression methods for reducing and reconstructing data. Experimental results show that data can be lossy compressed up to $75\%$ for bandwidth utilization and $95.33\%$ for storage space, with reconstruction accuracy close to $92\%$ .

8 citations


Cites methods from "Performance Characterization of Spa..."

  • ...We used IBM POWER8 telemetry logs dataset [45] to evaluate our proposed method for telemetry reduction and recon-...

    [...]

Journal ArticleDOI
TL;DR: In this article, a data placement strategy based on RDD dynamic weight is introduced for the Spark frame of geo-distributed cloud systems, aiming at the data placement problem, and the algorithm can effectively adjust the weight of node data placement according to the actual task execution information, and shorten the task execution time.
Abstract: With the rapid development and the widespread use of cloud computing in various applications, the number of users distributed in different regions has grown exponentially. Therefore, the Geo-distributed cloud systems have become a research hotspot and big data processing technology has also emerged. Nowadays, the most widely used big data processing framework is Spark. However, massive amounts of data are generated every moment, and the processing procedure becomes more and more complex, the execution efficiency of Spark has been greatly affected. In the Spark frame of geo-distributed cloud systems, aiming at the data placement problem, the data placement strategy based on RDD dynamic weight is introduced. The target node is selected with a strong computation capacity to place the data. Aiming at the problems of multi-task scheduling, a task will be scheduled to a node whose computation capacity can satisfy the requirement of this task. And then considering job classification and computing node performance, the optimized task scheduling strategy is in traduced. Experiments show that our algorithms can effectively adjust the weight of node data placement according to the actual task execution information, and shorten the task execution time.

7 citations

Dissertation
29 Apr 2019
TL;DR: This thesis aims to find the right techniques and design decisions to build a scalable distributed system for the IoT under the Fog Computing paradigm to ingest and process data and explore the trade-offs and challenges in the design of a solution from Edge to Cloud to address opportunities that current and future technologies will bring in an integrated way.
Abstract: In recent years there has been an extraordinary growth of the Internet of Things (IoT) and its protocols. The increasing diffusion of electronic devices with identification, computing and communication capabilities is laying ground for the emergence of a highly distributed service and networking environment. The above mentioned situation implies that there is an increasing demand for advanced IoT data management and processing platforms. Such platforms require support for multiple protocols at the edge for extended connectivity with the objects, but also need to exhibit uniform internal data organization and advanced data processing capabilities to fulfill the demands of the application and services that consume IoT data. One of the initial approaches to address this demand is the integration between IoT and the Cloud computing paradigm. There are many benefits of integrating IoT with Cloud computing. The IoT generates massive amounts of data, and Cloud computing provides a pathway for that data to travel to its destination. But today’s Cloud computing models do not quite fit for the volume, variety, and velocity of data that the IoT generates. Among the new technologies emerging around the Internet of Things to provide a new whole scenario, the Fog Computing paradigm has become the most relevant. Fog computing was introduced a few years ago in response to challenges posed by many IoT applications, including requirements such as very low latency, real-time operation, large geo-distribution, and mobility. Also this low latency, geo-distributed and mobility environments are covered by the network architecture MEC (Mobile Edge Computing) that provides an IT service environment and Cloud-computing capabilities at the edge of the mobile network, within the Radio Access Network (RAN) and in close proximity to mobile subscribers. Fog computing addresses use cases with requirements far beyond Cloud-only solution capabilities. The interplay between Cloud and Fog computing is crucial for the evolution of the so-called IoT, but the reach and specification of such interplay is an open problem. This thesis aims to find the right techniques and design decisions to build a scalable distributed system for the IoT under the Fog Computing paradigm to ingest and process data. The final goal is to explore the trade-offs and challenges in the design of a solution from Edge to Cloud to address opportunities that current and future technologies will bring in an integrated way. This thesis describes an architectural approach that addresses some of the technical challenges behind the convergence between IoT, Cloud and Fog with special focus on bridging the gap between Cloud and Fog. To that end, new models and techniques are introduced in order to explore solutions for IoT environments. This thesis contributes to the architectural proposals for IoT ingestion and data processing by 1) proposing the characterization of a platform for hosting IoT workloads in the Cloud providing multi-tenant data stream processing capabilities, the interfaces over an advanced data-centric technology, including the building of a state-of-the-art infrastructure to evaluate the performance and to validate the proposed solution. 2) studying an architectural approach following the Fog paradigm that addresses some of the technical challenges found in the first contribution. The idea is to study an extension of the model that addresses some of the central challenges behind the converge of Fog and IoT. 3) Design a distributed and scalable platform to perform IoT operations in a moving data environment. The idea after study data processing in Cloud, and after study the convenience of the Fog paradigm to solve the IoT close to the Edge challenges, is to define the protocols, the interfaces and the data management to solve the ingestion and processing of data in a distributed and orchestrated manner for the Fog Computing paradigm for IoT in a moving data environment.

2 citations


Cites background from "Performance Characterization of Spa..."

  • ...[122] Shuja-ur-Rehman Baig, Marcelo Amaral, Jordà Polo and David Carrera, "Performance Characterization of Spark Workloads on Shared NUMA Systems," 2018 IEEE Fourth International Conference on Big Data Computing Service and Applications (BigDataService), Bamberg, 2018, pp....

    [...]

  • ...[122] Shuja ur Rehman Baig, Marcelo Amaral, Jordà Polo, and David Carrera....

    [...]

Proceedings ArticleDOI
01 May 2022
TL;DR: In this paper , the authors introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations.
Abstract: Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
References
More filters
Journal ArticleDOI
TL;DR: NUMA (non-uniform memory access) is the phenomenon that memory at various points in the address space of a processor have different performance characteristics, and at current processor speeds, the signal path length from the processor to memory plays a significant role.
Abstract: NUMA (non-uniform memory access) is the phenomenon that memory at various points in the address space of a processor have different performance characteristics. At current processor speeds, the signal path length from the processor to memory plays a significant role. Increased signal path length not only increases latency to memory but also quickly becomes a throughput bottleneck if the signal path is shared by multiple processors. The performance differences to memory were noticeable first on large-scale systems where data paths were spanning motherboards or chassis. These systems required modified operating-system kernels with NUMA support that explicitly understood the topological properties of the system’s memory (such as the chassis in which a region of memory was located) in order to avoid excessively long signal path lengths. (Altix and UV, SGI’s large address space systems, are examples. The designers of these products had to modify the Linux kernel to support NUMA; in these machines, processors in multiple chassis are linked via a proprietary interconnect called NUMALINK.)

161 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....

    [...]

Journal ArticleDOI
TL;DR: This study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling, and finds a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware.
Abstract: Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent contention for shared resource can be mitigated via thread scheduling. Scheduling is an attractive tool, because it does not require extra hardware and is relatively easy to integrate into the system. Our study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling. The most difficult part of the problem is to find a classification scheme for threads, which would determine how they affect each other when competing for shared resources. We provide a comprehensive analysis of such classification schemes using a newly proposed methodology that enables to evaluate these schemes separately from the scheduling algorithm itself and to compare them to the optimal. As a result of this analysis we discovered a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware. To show the applicability of our analysis we design a new scheduling algorithm, which we prototype at user level, and demonstrate that it performs within 2p of the optimal. We also conclude that the highest impact of contention-aware scheduling techniques is not in improving performance of a workload as a whole but in improving quality of service or performance isolation for individual applications and in optimizing system energy consumption.

158 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....

    [...]

  • ...For example, [6] has characterized the NUMA performance impact related to remote memory access introduced by the OS...

    [...]

Journal ArticleDOI
TL;DR: The core microarchitecture innovations made in the POWER8 processor, designed to significantly improve both single-thread performance and single-core throughput over its predecessor, the POWER7® processor, are described.
Abstract: The POWER8™ processor is the latest RISC (Reduced Instruction Set Computer) microprocessor from IBM. It is fabricated using the company's 22-nm Silicon on Insulator (SOI) technology with 15 layers of metal, and it has been designed to significantly improve both single-thread performance and single-core throughput over its predecessor, the POWER7® processor. The rate of increase in processor frequency enabled by new silicon technology advancements has decreased dramatically in recent generations, as compared to the historic trend. This has caused many processor designs in the industry to show very little improvement in either single-thread or single-core performance, and, instead, larger numbers of cores are primarily pursued in each generation. Going against this industry trend, the POWER8 processor relies on a much improved core and nest microarchitecture to achieve approximately one-and-a-half times the single-thread performance and twice the single-core throughput of the POWER7 processor in several commercial applications. Combined with a 50% increase in the number of cores (from 8 in the POWER7 processor to 12 in the POWER8 processor), the result is a processor that leads the industry in performance for enterprise workloads. This paper describes the core microarchitecture innovations made in the POWER8 processor that resulted in these significant performance benefits.

154 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...Simultaneous Multi-Threading (SMT) [4] Each Power8 processor is composed of two memory regions (i....

    [...]

  • ...B. IBM Power 8 We run all the experiments in the IBM Power System 8247- 42L, which is 2-way 12-core IBM Power8 machine with all cores at 3.02GHz and with each core able to run up to 8 Simultaneous Multi-Threading (SMT) [4] Each Power8 processor is composed of two memory regions (i.e NUMA node) with 6 cores and their own memory controller and 256GB of RAM....

    [...]

Proceedings ArticleDOI
Zoltan Majo1, Thomas R. Gross1
04 Jun 2011
TL;DR: This work presents a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem, and describes two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention.
Abstract: Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.

110 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....

    [...]

Proceedings Article
12 Jun 2012
TL;DR: This study shows that live VM migration can be used to mitigate the contentions on micro-architectural resources and proposes and evaluates two cluster-level virtual machine scheduling techniques for cache sharing and NUMA affinity, which do not require any prior knowledge on the behaviors of VMs.
Abstract: Although virtual machine (VM) migration has been used to avoid conflicts on traditional system resources like CPU and memory, micro-architectural resources such as shared caches, memory controllers, and non-uniform memory access (NUMA) affinity, have only relied on intra-system scheduling to reduce contentions on them. This study shows that live VM migration can be used to mitigate the contentions on micro-architectural resources. Such cloud-level VM scheduling can widen the scope of VM selections for architectural shared resources beyond a single system, and thus improve the opportunity to further reduce possible conflicts. This paper proposes and evaluates two cluster-level virtual machine scheduling techniques for cache sharing and NUMA affinity, which do not require any prior knowledge on the behaviors of VMs.

71 citations


"Performance Characterization of Spa..." refers background in this paper

  • ...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....

    [...]

Frequently Asked Questions (18)
Q1. What have the authors contributed in "Performance characterization of spark workloads on shared numa systems" ?

This opens an enormous range of opportunities for runtimes and applications that aim to improve their performance by leveraging low latencies and high bandwidth provided by RAM. This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes. The authors explore several workloads run on top of the IBM Power8 processor, and provide manual strategies that can leverage performance improvements up to 40 % on Spark workloads when using smart processorpinning and workload collocation strategies. 

Because of manual binding minimizes remote memory access and local memory access has higher memory bandwidth and lower latency, some workload benefit from that. 

When processes are not bound to specific NUMA nodes, the OS is in charge of all placement decisions, which includes reactive migrations to balance the load. 

Because of the main memory is connected to the processor using separate links for reading and write operations, with two links for memory reads and one link for memory writes, the system has asymmetric read and write bandwidth. 

Hardware counters have been used to collect most realtime information from experimental executions, using the perfmon2 [19] library. 

the numactl command has been used to bind a workload in a set of CPUs or in memory regions (e.g. in a NUMA node); the nB nomenclature is used to describe different binding configurations, where n is the number of assigned NUMA nodes, e.g. 1B for workloads bound to 1 NUMA node, 2B for workloads bound to 2 NUMA nodes, etc. 

In particular, for any pair of workloads that is evaluated, the authors run both of them in a continuous loop of 90 minutes, from which the first 15 minutes are taken as a warm-up period and the final 15 minutes as a cool-down process. 

The obtained results show that binding spark processes to particular NUMA nodes can speed up the completion time of co-located workloads up to 1.39x at maximum due to less interconnect traffic, less remote memory access, and less context switches and CPI. 

As it can be observed, SVM is constrained by the high CPU usage, reaching around 80% for the user CPU time only, that when added to the system and wait CPU times tops to about 100% CPU usage, which is the actual performance bottleneck. 

The optimal configurations in case of 4B are 12 cores per worker and 1 worker per node for SQL, 8 cores per worker and 3 workers per node for SVM and 6 cores per worker and 3 workers per node for PageRank. 

SQL is a more interesting case because CPU and Memory Bandwidth usage are really low for the fastest configuration, and no other resource is apparently acting as a bottleneck. 

Some amount of memory is intentionally left for the OS and other processes (e.g spark driver, master) to avoid the slowdown effects not related to NUMA. 

This final experiment explores the benefits of workload colocation and process binding (cores and memory) as a mechanism to improve system throughput and increase resource utilization. 

only a third of the hardware threads are in use and that is why the average CPU utilization is shown to be low: several hardware threads are idle. 

Workload co-location is well-known to possibly slow down interference-sensitive applications, however, the impact of NUMA co-scheduled Spark workloads are still not completely understood. 

Considering following formula ( current node(s) + new node(s) / current node(s)), the theoretical speedup of allocating one new NUMA node to application with one current node would lead to a speedup of 2x. 

These configurations use a too low number of cores or workers that are not enough to fully utilize the available compute resources and produce optimal results. 

In practice, what is avoiding the total CPU usage to go higher is the fact that the number of threads that are spawn in this configuration (only 48) is well below the number of hardware threads offered by the system.