Performance Characterization of Spark Workloads on Shared NUMA Systems
Summary (3 min read)
Introduction
- This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes.
- To achieve a good performance for in-memory computing frameworks on a NUMA system, there is a need to understand the topology of the interconnect between processor sockets and memory banks.
- This paper explores how Spark-based in-memory computing workloads are impacted by the effects of NUMA architecture, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to leverage memory-consumption patterns to smartly co-locate workloads in scale-up nodes.
- Section IV introduces the evaluation methodology used for the experiments.
III. RELATED WORK
- These works characterize some key sources of workload performance degradation related to NUMA (such as additional latencies and possibly lower memory bandwidth because of remote memory access and contentions on the memory controller, bus connections or cache contention), and propose OS task placement strategies for mitigating remote memory access.
- Another example is the work of [14], where the authors characterized the performance impact of NUMA on graphanalytics applications.
- While those works present a detailed performance characterization of Spark workloads on NUMA systems on an Intel Ivy Bridge server, the NUMA performance characterization of IBM Power8 systems is still not understood.
A. Workloads
- The experiments presented in this paper are based on SparkBench [17], which is a benchmark suite developed by IBM and widely tested in Power8 systems.
- From the range of available workloads provided by the benchmark, Support Vector Machines (SVM), PageRank, and Spark SQL RDDRelation have been selected for the evaluation.
- These workloads are wellknown in the literature, and combine different characteristics to cover a large range of possible configurations.
B. Experimental Setup
- Since the goal of this paper is to evaluate the performance of Spark workloads on NUMA hardware, all the experiments are conducted in a single machine; the characteristics of the machine’s architecture are described in Section II-B.
- All the other parameters and values used to configure Spark, during the experiments execution described later in this paper, are summarized in Table I. Hardware counters have been used to collect most realtime information from experimental executions, using the perfmon2 [19] library.
- Memory bandwidth is calculated based on the formula defined in [20].
- For CPU usage, memory usage, and context switches, the vmstat tool has been used.
- To collect information about NUMA memory access, the numastat is used.
V. EXPERIMENT 1: WORKLOAD CHARACTERIZATION
- This experiment consists of a performance characterization of Spark workloads, changing the configuration parameters of Spark itself and observing the impact of different configurations in terms of completion time and resource consumption.
- More specifically, this experiment analyzes the effect of the software configuration in the resource usage intensiveness and possible speedups for the workloads described in Section IV-A.
- This is due to SQL is more impacted because of thread locks and cache contention than the SVM.
- Based on this property the authors classify configurations in different groups: Within 10% of optimal: configurations for which completion time is very close to the best execution time observed for that particular workload.
- This is due to more executors which execute more threads to process the tasks.
VI. EXPERIMENT 2: BINDING TO NUMA NODES
- Allocating more NUMA nodes to a workload has the potential to increase resources (such as memory bandwidth, CPU, and memory) and possibly lower cache contention (due to the availability of additional cores and cache), but it can also involve a trade-off: using remote memory accesses and dealing with bus contention, which can lead to slowdowns in some scenarios.
- In case of 2B, the optimal configurations are 8 cores per worker and 3 workers per node for SQL, 6 cores per worker and 6 workers per node for SVM and 4 cores per worker and 6 workers per node for PageRank.
- The results of this experiment, as summarized in Table VII, show a significant speedup when comparing manual binding versus the OS allocating the resources, but not for all workloads.
- The results of this experiment also show that how applications scale with more NUMA nodes.
VII. EXPERIMENT 3: WORKLOAD CO-SCHEDULING
- This final experiment explores the benefits of workload colocation and process binding (cores and memory) as a mechanism to improve system throughput and increase resource utilization.
- This experiment, therefore, evaluates the performance impact on workloads when sharing the same machine, that is when workloads are co-located.
- The authors repeat the process with the 1B-3B configurations (1B for 1 NUMA node with binding, and 3B for 3 NUMA nodes with binding), in which one workload gets assigned one NUMA node while the other gets allocated the other three nodes.
- In all cases the authors executed all combinations of SQL-SVM, SQL-PageRank, SVM-PageRank.
- In a case of remote memory access, there is an increase of 80-91.93% when the same experiments are executed without binding.
VIII. CONCLUSIONS
- In-memory computing is becoming one of the most popular approaches for real-time big data processing as data sets grow and more memory capacity is made available to popular runtimes such as Spark.
- To deliver large physical memory capacity, modern processors feature Non-Uniform Memory Architectures (NUMA).
- Each socket can have multiple processors with its own memory.
- Large sets of experiments were executed to evaluate several Spark workloads, and the results demonstrated that workload colocation is a smart strategy to improve resource utilization for memory-intensive workloads placed in modern NUMA processors.
- Highly concurrent configurations produce undesired memory access patterns across NUMA nodes that push to the limit the existing memory bandwidth, making co-scheduling a good choice.
Did you find this useful? Give us your feedback
Citations
8 citations
Cites methods from "Performance Characterization of Spa..."
...We used IBM POWER8 telemetry logs dataset [45] to evaluate our proposed method for telemetry reduction and recon-...
[...]
7 citations
2 citations
Cites background from "Performance Characterization of Spa..."
...[122] Shuja-ur-Rehman Baig, Marcelo Amaral, Jordà Polo and David Carrera, "Performance Characterization of Spark Workloads on Shared NUMA Systems," 2018 IEEE Fourth International Conference on Big Data Computing Service and Applications (BigDataService), Bamberg, 2018, pp....
[...]
...[122] Shuja ur Rehman Baig, Marcelo Amaral, Jordà Polo, and David Carrera....
[...]
References
161 citations
"Performance Characterization of Spa..." refers background in this paper
...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....
[...]
158 citations
"Performance Characterization of Spa..." refers background in this paper
...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....
[...]
...For example, [6] has characterized the NUMA performance impact related to remote memory access introduced by the OS...
[...]
154 citations
"Performance Characterization of Spa..." refers background in this paper
...Simultaneous Multi-Threading (SMT) [4] Each Power8 processor is composed of two memory regions (i....
[...]
...B. IBM Power 8 We run all the experiments in the IBM Power System 8247- 42L, which is 2-way 12-core IBM Power8 machine with all cores at 3.02GHz and with each core able to run up to 8 Simultaneous Multi-Threading (SMT) [4] Each Power8 processor is composed of two memory regions (i.e NUMA node) with 6 cores and their own memory controller and 256GB of RAM....
[...]
110 citations
"Performance Characterization of Spa..." refers background in this paper
...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....
[...]
71 citations
"Performance Characterization of Spa..." refers background in this paper
...vestigate and mitigate the impact of NUMA on workload performance, this topic is still gaining traction in the literature in recent years [5], [6], [7], [8], [9], [10] [11], [12], [13]....
[...]
Related Papers (5)
Frequently Asked Questions (18)
Q1. What have the authors contributed in "Performance characterization of spark workloads on shared numa systems" ?
This opens an enormous range of opportunities for runtimes and applications that aim to improve their performance by leveraging low latencies and high bandwidth provided by RAM. This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes. The authors explore several workloads run on top of the IBM Power8 processor, and provide manual strategies that can leverage performance improvements up to 40 % on Spark workloads when using smart processorpinning and workload collocation strategies.
Q2. What is the main bottleneck of manual binding?
Because of manual binding minimizes remote memory access and local memory access has higher memory bandwidth and lower latency, some workload benefit from that.
Q3. What is the role of the OS when the process is not bound to specific nodes?
When processes are not bound to specific NUMA nodes, the OS is in charge of all placement decisions, which includes reactive migrations to balance the load.
Q4. Why does the system have asymmetric read and write bandwidth?
Because of the main memory is connected to the processor using separate links for reading and write operations, with two links for memory reads and one link for memory writes, the system has asymmetric read and write bandwidth.
Q5. What library is used to collect realtime information from experimental executions?
Hardware counters have been used to collect most realtime information from experimental executions, using the perfmon2 [19] library.
Q6. What is the nB nomenclature used to describe different binding configurations?
the numactl command has been used to bind a workload in a set of CPUs or in memory regions (e.g. in a NUMA node); the nB nomenclature is used to describe different binding configurations, where n is the number of assigned NUMA nodes, e.g. 1B for workloads bound to 1 NUMA node, 2B for workloads bound to 2 NUMA nodes, etc.
Q7. How long does it take to run two workloads at the same time?
In particular, for any pair of workloads that is evaluated, the authors run both of them in a continuous loop of 90 minutes, from which the first 15 minutes are taken as a warm-up period and the final 15 minutes as a cool-down process.
Q8. How does the Spark process speed up the completion time of co-located workloads?
The obtained results show that binding spark processes to particular NUMA nodes can speed up the completion time of co-located workloads up to 1.39x at maximum due to less interconnect traffic, less remote memory access, and less context switches and CPI.
Q9. What is the reason for the performance bottleneck in SVM?
As it can be observed, SVM is constrained by the high CPU usage, reaching around 80% for the user CPU time only, that when added to the system and wait CPU times tops to about 100% CPU usage, which is the actual performance bottleneck.
Q10. What is the optimal configuration for a job?
The optimal configurations in case of 4B are 12 cores per worker and 1 worker per node for SQL, 8 cores per worker and 3 workers per node for SVM and 6 cores per worker and 3 workers per node for PageRank.
Q11. Why is SQL a more interesting case?
SQL is a more interesting case because CPU and Memory Bandwidth usage are really low for the fastest configuration, and no other resource is apparently acting as a bottleneck.
Q12. What is the purpose of the memory left for the OS and other processes?
Some amount of memory is intentionally left for the OS and other processes (e.g spark driver, master) to avoid the slowdown effects not related to NUMA.
Q13. What is the purpose of this experiment?
This final experiment explores the benefits of workload colocation and process binding (cores and memory) as a mechanism to improve system throughput and increase resource utilization.
Q14. What is the reason why the average CPU utilization is low?
only a third of the hardware threads are in use and that is why the average CPU utilization is shown to be low: several hardware threads are idle.
Q15. What is the impact of colocation on the performance of a workload?
Workload co-location is well-known to possibly slow down interference-sensitive applications, however, the impact of NUMA co-scheduled Spark workloads are still not completely understood.
Q16. What is the speedup of a new NUMA node?
Considering following formula ( current node(s) + new node(s) / current node(s)), the theoretical speedup of allocating one new NUMA node to application with one current node would lead to a speedup of 2x.
Q17. What is the reason why the configurations are not optimal?
These configurations use a too low number of cores or workers that are not enough to fully utilize the available compute resources and produce optimal results.
Q18. What is the reason why the CPU usage is so low?
In practice, what is avoiding the total CPU usage to go higher is the fact that the number of threads that are spawn in this configuration (only 48) is well below the number of hardware threads offered by the system.