Thomas L. Sterling, John Salmon, Donald J. Becker, Savarese, Daniel F. Savarese MIT Press, Cambridge, MA, 1999, 250 pp. ISBN: 026269218X, $31.95 The authors of this book have attempted to describe in 232 pages a subject which is difficult to fit into a book and which is rapidly evolving. Their stated purpose is "enabling, facilitating, and accelerating the adoption of the Beowulf model of distributed computing." They provide an in-depth view of one possible Beowulf system covering details of hardware selection, operating system configuration, communication software and a parallel sorting application. Overall the book is informative enough to convince people of the value of using Beowulf clustering and basically fulfills the stated purpose, but the subject matter is too broad for a single book. The book begins with an overview of Beowulf systems including some background material on parallel computers and outlining the rest of the book. This discussion on parallel computers in general is quite brief. The authors point out the importance of the recent increases in performance of mass-market computers, which is the reason why Beowulf clusters exist. They continue this section by giving an overview of the hardware and software components of the Beowulf cluster. Following the introduction is a discussion of the hardware elements used to build a typical Beowulf cluster. This discussion is nicely written and introduces topics such as the PCI bus, types of memory, and motherboards. The details are useful now, although the details of hardware design evolve too rapidly for this discussion to be as useful several years from now. Next, the authors introduce the Linux operating system. This discussion is not detailed enough to guide a user through installation and configuration, but it does provide an overview and it includes a useful list of references both printed and electronic. Following the Linux discussion, the authors sketch networking related issues. They discuss most available network hardware solutions and discuss Ethernet in sufficient detail to understand its performance. Their discussion of TCP/IP is detailed enough to understand IP addressing but does not include a discussion of routing. Overall, this discussion is a mixture of overviews of topics and detailed discussions. The authors provide a useful discussion of how to manage a Beowulf cluster. They suggest practical methods of cloning nodes using Linux tools and they describe methods for day-to-day administration. This discussion is possibly the most useful contribution of the book. Most of the book consists of topics covered on more detail in on-line resources, but managing a Beowulf cluster is a little off the beaten path. Next, the authors discuss parallelism. This includes a categorization of parallel algorithms along with an introduction to a variety of parallel performance metrics. They also give a useful introduction to MPI including a nice exposition about a sorting example. Their example provides a good example of MPI programming along with a useful analysis of the performance characteristics of the application. Overall, the authors present a useful discussion of Beowulf clusters. Most topics are discussed lightly while a few are discussed in-depth. The reader should not expect this book to answer every question encountered in configuring and using a Beowulf cluster. Instead this book offers an overview of the process and details can be obtained by consulting manual pages, Linux Howto documents and on-line resources about Linux and MPI. Benjamin R. Seyfarth  University of Southern Mississippi

How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters

We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.

/pdf/scaling-spark-on-hpc-systems-20oartrtqj.pdf

Scaling Spark on HPC Systems

With the fast development of big data systems in recent years, a variety of open-source benchmarks have been built to evaluate and compare the workloads on these systems, and to promote their technology improvement. However, to date no comprehensive survey has been written on this topic. This paper attempts to fill the void by presenting a review of the state-of-the-art big data benchmarking efforts. The paper first gives an overview of popular open-source benchmarks from the point of view of big data systems. It then reviews the three important aspects of benchmarking - workload generation techniques, workload input data generation techniques, and metrics used to assess systems. For each aspect, the paper divides the surveyed benchmarks into different categories and describes some representative benchmarks, rather than all benchmarks listed, in each category, following the discussion of potential research directions to motivate future work in this area.

Benchmarking Big Data Systems: A Review

Non-Volatile Memory (NVM) offers byte-addressability with DRAM like performance along with persistence. Thus, NVMs provide the opportunity to build high-throughput storage systems for data-intensive applications. HDFS (Hadoop Distributed File System) is the primary storage engine for MapReduce, Spark, and HBase. Even though HDFS was initially designed for commodity hardware, it is increasingly being used on HPC (High Performance Computing) clusters. The outstanding performance requirements of HPC systems make the I/O bottlenecks of HDFS a critical issue to rethink its storage architecture over NVMs. In this paper, we present a novel design for HDFS to leverage the byte-addressability of NVM for RDMA (Remote Direct Memory Access)-based communication. We analyze the performance potential of using NVM for HDFS and re-design HDFS I/O with memory semantics to exploit the byte-addressability fully. We call this design NVFS (NVM- and RDMA-aware HDFS). We also present cost-effective acceleration techniques for HBase and Spark to utilize the NVM-based design of HDFS by storing only the HBase Write Ahead Logs and Spark job outputs to NVM, respectively. We also propose enhancements to use the NVFS design as a burst buffer for running Spark jobs on top of parallel file systems like Lustre. Performance evaluations show that our design can improve the write and read throughputs of HDFS by up to 4x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 45%. The proposed design also reduces the overall execution time of the SWIM workload by up to 18% over HDFS with a maximum benefit of 37% for job-38. For Spark TeraSort, our proposed scheme yields a performance gain of up to 11%. The performances of HBase insert, update, and read operations are improved by 21%, 16%, and 26%, respectively. Our NVM-based burst buffer can improve the I/O performance of Spark PageRank by up to 24% over Lustre. To the best of our knowledge, this paper is the first attempt to incorporate NVM with RDMA for HDFS.

https://dl.acm.org/doi/pdf/10.1145/2925426.2926290

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA

In-Memory cluster Computing (IMC) frameworks (e.g., Spark) have become increasingly important because they typically achieve more than 10× speedups over the traditional On-Disk cluster Computing (ODC) frameworks for iterative and interactive applications. Like ODC, IMC frameworks typically run the same given programs repeatedly on a given cluster with similar input dataset size each time. It is challenging to build performance model for IMC program because: 1) the performance of IMC programs is more sensitive to the size of input dataset, which is known to be difficult to be incorporated into a performance model due to its complex effects on performance; 2) the number of performance-critical configuration parameters in IMC is much larger than ODC (more than 40 vs. around 10), the high dimensionality requires more sophisticated models to achieve high accuracy. To address this challenge, we propose DAC, a datasize-aware auto-tuning approach to efficiently identify the high dimensional configuration for a given IMC program to achieve optimal performance on a given cluster. DAC is a significant advance over the state-of-the-art because it can take the size of input dataset and 41 configuration parameters as the parameters of the performance model for a given IMC program, --- unprecedented in previous work. It is made possible by two key techniques: 1) Hierarchical Modeling (HM), which combines a number of individual sub-models in a hierarchical manner; 2) Genetic Algorithm (GA) is employed to search the optimal configuration. To evaluate DAC, we use six typical Spark programs, each with five different input dataset sizes. The evaluation results show that DAC improves the performance of six typical Spark programs, each with five different input dataset sizes compared to default configurations by a factor of 30.4x on average and up to 89x. We also report that the geometric mean speedups of DAC over configurations by default, expert, and RFHOC are 15.4x, 2.3x, and 1.5x, respectively.

Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing

Apache Hadoop Map Reduce has been highly successful in processing large-scale, data-intensive batch applications on commodity clusters. However, for low-latency interactive applications and iterative computations, Apache Spark, an emerging in-memory processing framework, has been stealing the limelight. Recent studies have shown that current generation Big Data frameworks (like Hadoop) cannot efficiently leverage advanced features (e.g. RDMA) on modern clusters with high-performance networks. One of the major bottlenecks is that these middleware are traditionally written with sockets and do not deliver the best performance on modern HPC systems with RDMA-enabled high-performance interconnects. In this paper, we first assess the opportunities of bringing the benefits of RDMA into the Spark framework. We further propose a high-performance RDMA-based design for accelerating data shuffle in the Spark framework on high-performance networks. Performance evaluations show that our proposed design can achieve 79-83% performance improvement for Group By, compared with the default Spark running with IP over Infini Band (IPoIB) FDR on a 128-256 core cluster. We adopt a plug-in-based approach that can make our design to be easily integrated with newer Spark releases. To the best our knowledge, this is the first design for accelerating Spark with RDMA for Big Data processing.

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

HDFS (Hadoop Distributed File System) is the primary storage of Hadoop. Even though data locality offered by HDFS is important for Big Data applications, HDFS suffers from huge I/O bottlenecks due to the tri-replicated data blocks and cannot efficiently utilize the available storage devices in an HPC (High Performance Computing) cluster. Moreover, due to the limitation of local storage space, it is challenging to deploy HDFS in HPC environments. In this paper, we present a hybrid design (Triple-H) that can minimize the I/O bottlenecks in HDFS and ensure efficient utilization of the heterogeneous storage devices (e.g. RAM, SSD, and HDD) available on HPC clusters. We also propose effective data placement policies to speed up Triple-H. Our design integrated with parallel file system (e.g. Lustre) can lead to significant storage space savings and guarantee fault-tolerance. Performance evaluations show that Triple-H can improve the write and read throughputs of HDFS by up to 7x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 3x. Our design also improves the execution time of the Sort benchmark by up to 40% over default HDFS and 54% over Lustre. The alignment phase of the CloudBurst [20] application is accelerated by 19%. Triple-H also benefits the performance of SequenceCount and Grep in PUMA [15] over both default HDFS and Lustre.

Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture

The in-memory data processing framework, Apache Spark, has been stealing the limelight for low-latency interactive applications, iterative and batch computations. Our early experience study [17] has shown that Apache Spark can be enhanced to leverage advanced features (e.g., RDMA) on highperformance networks (e.g., InfiniBand and RoCE) to improve the performance of shuffle phase. With the fast evolving of the Apache Spark ecosystem, the Spark architecture has been changing a lot. This motivates us to investigate whether the earlier RDMA design can be adapted and further enhanced for the new Apache Spark architecture. We also aim to improve the performance for various Spark workloads (e.g., Batch, Graph, SQL). In this paper, we present a detailed design for high-performance RDMA-based Apache Spark on high-performance networks. We conduct systematic performance evaluations on three modern clusters (Chameleon, SDSC Comet, and an in-house cluster) with cutting-edge InfiniBand technologies, such as latest IB EDR (100 Gbps) network, recently introduced Single Root I/O Virtualization (SR-IOV) technology for IB, etc. The evaluation results show that compared to the default Spark running with IP over InfiniBand (IPoIB), our proposed design can achieve up to 79% performance improvement for Spark RDD operation benchmarks (e.g., GroupBy, SortBy), up to 38% performance improvement for batch workloads (e.g., Sort and TeraSort in Intel HiBench), up to 46% performance improvement for graph processing workloads (e.g., PageRank), up to 32% performance improvement for SQL queries (e.g., Aggregation, Join) on varied scales (up to 1,536 cores) of bare-metal IB clusters. Performance evaluations on SR-IOV enabled IB clusters also show 37% improvement achieved by our RDMA-based design. Our RDMA-based Spark design is implemented as a pluggable module and it does not change any Spark APIs, which means that it can be combined with other existing enhanced designs for Apache Spark and Hadoop proposed in the community. To show this, we further evaluate the performance of a combined version of ‘RDMA-Spark+RDMA-HDFS’ and the numbers show that the combination can achieve the best performance with up to 82% improvement for Intel HiBench Sort and TeraSort on SDSC Comet cluster.

High-performance design of apache spark with RDMA and its benefits on various workloads

Hadoop Distributed File System (HDFS) is the underlying storage engine of many Big Data processing frameworks such as Hadoop MapReduce, HBase, Hive, and Spark. Even though HDFS is well-known for its scalability and reliability, the requirement of large amount of local storage space makes HDFS deployment challenging on HPC clusters. Moreover, HPC clusters usually have large installation of parallel file system like Lustre. In this study, we propose a novel design to integrate HDFS with Lustre through a high performance key-value store. We design a burst buffer system using RDMA-based Mem cached and present three schemes to integrate HDFS with Lustre through this buffer layer, considering different aspects of I/O, data-locality, and fault-tolerance. Our proposed schemes can ensure performance improvement for Big Data applications on HPC clusters. At the same time, they lead to reduced local storage requirement. Performance evaluations show that, our design can improve the write performance of Test DFSIO by up to 2.6x over HDFS and 1.5x over Lustre. The gain in read throughput is up to 8x. Sort execution time is reduced by up to 28% over Lustre and 19% over HDFS. Our design can also significantly benefit I/O-intensive workloads compared to both HDFS and Lustre.

Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store

For data-intensive computing, the low throughput of the existing disk-bound storage systems is a major bottleneck. Recent emergence of the in-memory file systems with heterogeneous storage support mitigates this problem to a great extent. Parallel programming frameworks, e.g. Hadoop MapReduce and Spark are increasingly being run on such high-performance file systems. However, no comprehensive study has been done to analyze the impacts of the in-memory file systems on various Big Data applications. This paper characterizes two file systems in literature, Tachyon [17] and Triple-H [13] that support in-memory and heterogeneous storage, and discusses the impacts of these two architectures on the performance and fault tolerance of Hadoop MapReduce and Spark applications. We present a complete methodology for evaluating MapReduce and Spark workloads on top of in-memory file systems and provide insights about the interactions of different system components while running these workloads. We also propose advanced acceleration techniques to adapt Triple-H for iterative applications and study the impact of different parameters on the performance of MapReduce and Spark jobs on HPC systems. Our evaluations show that, although Tachyon is 5x faster than HDFS for primitive operations, Triple-H performs 47% and 2.4x better than Tachyon for MapReduce and Spark workloads, respectively. Triple-H also accelerates K-Means by 15% over HDFS and 9% over Tachyon.

Dipti Shankar

Papers

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture

High-performance design of apache spark with RDMA and its benefits on various workloads

Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store

Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters