scispace - formally typeset
Search or ask a question
Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.


Papers
More filters
Proceedings ArticleDOI
27 Jun 2015
TL;DR: By combining Weka's usability and Spark's processing power, DistributedWekaSpark provides a usable prototype distributed Big Data Mining workbench that achieves near-linear scaling in executing various real-world scale workloads - 91.4% weak scaling efficiency on average and up to 4x faster on average than Hadoop.
Abstract: Effective Big Data Mining requires scalable and efficient solutions that are also accessible to users of all levels of expertise. Despite this, many current efforts to provide effective knowledge extraction via large-scale Big Data Mining tools focus more on performance than on use and tuning which are complex problems even for experts. Weka is a popular and comprehensive Data Mining workbench with a well-known and intuitive interface, nonetheless it supports only sequential single-node execution. Hence, the size of the datasets and processing tasks that Weka can handle within its existing environment is limited both by the amount of memory in a single node and by sequential execution. This work discusses Distributed Weka Spark, a distributed framework for Weka which maintains its existing user interface. The framework is implemented on top of Spark, a Hadoop-related distributed framework with fast in-memory processing capabilities and support for iterative computations. By combining Weka's usability and Spark's processing power, Distributed Weka Spark provides a usable prototype distributed Big Data Mining workbench that achieves near-linear scaling in executing various real-world scale workloads - 91.4% weak scaling efficiency on average and up to 4x faster on average than Hadoop.

54 citations

Proceedings ArticleDOI
15 Dec 2014
TL;DR: This work studies the effectiveness of conventional generalpurpose processors on Spark workloads in comparison to those of Hadoop, CloudSuite, SPEC CPU2006, TPC-C, and DesktopCloud, and evaluates the benchmarks on a 17-node Xeon cluster.
Abstract: © 2014 IEEE. The increasing demands of big data applications have led researchers and practitioners to turn to in-memory computing to speed processing. For instance, the Apache Spark framework stores intermediate results in memory to deliver good performance on iterative machine learning and interactive data analysis tasks. To the best of our knowledge, though, little work has been done to understand Spark's architectural and microarchitectural behaviors. Furthermore, although conventional commodity processors have been well optimized for traditional desktops and HPC, their effectiveness for Spark workloads remains to be studied. To shed some light on the effectiveness of conventional generalpurpose processors on Spark workloads, we study their behavior in comparison to those of Hadoop, CloudSuite, SPEC CPU2006, TPC-C, and DesktopCloud. We evaluate the benchmarks on a 17-node Xeon cluster. Our performance results reveal that Spark workloads have significantly different characteristics from Hadoop and traditional HPC benchmarks. At the system level, Spark workloads have good memory bandwidth utilization (up to 50%), stable memory accesses, and high disk IO request frequency (200 per second). At the microarchitectural level, the cache and TLB are effective for Spark workloads, but the L2 cache miss rate is high. We hope this work yields insights for chip and datacenter system designers.

54 citations

01 Jan 2002
TL;DR: In this paper, the authors quantified the losses and gains of different engine control strategies for turbocharged engines and analyzed the trade-off between fuel economy and transient performance in turbo-charged engines.
Abstract: The subject of this study is the trade-off between fuel economy and transient performance in turbocharged engines. It quantifies the losses and gains of different engine control strategies. Two ext ...

54 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: The RDMA-based Spark design is implemented as a pluggable module and it does not change any Spark APIs, which means that it can be combined with other existing enhanced designs for Apache Spark and Hadoop proposed in the community.
Abstract: The in-memory data processing framework, Apache Spark, has been stealing the limelight for low-latency interactive applications, iterative and batch computations. Our early experience study [17] has shown that Apache Spark can be enhanced to leverage advanced features (e.g., RDMA) on highperformance networks (e.g., InfiniBand and RoCE) to improve the performance of shuffle phase. With the fast evolving of the Apache Spark ecosystem, the Spark architecture has been changing a lot. This motivates us to investigate whether the earlier RDMA design can be adapted and further enhanced for the new Apache Spark architecture. We also aim to improve the performance for various Spark workloads (e.g., Batch, Graph, SQL). In this paper, we present a detailed design for high-performance RDMA-based Apache Spark on high-performance networks. We conduct systematic performance evaluations on three modern clusters (Chameleon, SDSC Comet, and an in-house cluster) with cutting-edge InfiniBand technologies, such as latest IB EDR (100 Gbps) network, recently introduced Single Root I/O Virtualization (SR-IOV) technology for IB, etc. The evaluation results show that compared to the default Spark running with IP over InfiniBand (IPoIB), our proposed design can achieve up to 79% performance improvement for Spark RDD operation benchmarks (e.g., GroupBy, SortBy), up to 38% performance improvement for batch workloads (e.g., Sort and TeraSort in Intel HiBench), up to 46% performance improvement for graph processing workloads (e.g., PageRank), up to 32% performance improvement for SQL queries (e.g., Aggregation, Join) on varied scales (up to 1,536 cores) of bare-metal IB clusters. Performance evaluations on SR-IOV enabled IB clusters also show 37% improvement achieved by our RDMA-based design. Our RDMA-based Spark design is implemented as a pluggable module and it does not change any Spark APIs, which means that it can be combined with other existing enhanced designs for Apache Spark and Hadoop proposed in the community. To show this, we further evaluate the performance of a combined version of ‘RDMA-Spark+RDMA-HDFS’ and the numbers show that the combination can achieve the best performance with up to 82% improvement for Intel HiBench Sort and TeraSort on SDSC Comet cluster.

54 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: Analysis of the results has shown that replacing Hadoop with Spark or Flink can lead to a reduction in execution times by 77% and 70% on average, respectively, for non-sort benchmarks.
Abstract: The increasing adoption of Big Data analytics has led to a high demand for efficient technologies in order to manage and process large datasets. Popular MapReduce frameworks such as Hadoop are being replaced by emerging ones like Spark or Flink, which improve both the programming APIs and performance. However, few works have focused on comparing these frameworks. This paper addresses this issue by performing a comparative evaluation of Hadoop, Spark and Flink using representative Big Data workloads and considering factors like performance and scalability. Moreover, the behavior of these frameworks has been characterized by modifying some of the main parameters of the workloads such as HDFS block size, input data size, interconnect network or thread configuration. The analysis of the results has shown that replacing Hadoop with Spark or Flink can lead to a reduction in execution times by 77% and 70% on average, respectively, for non-sort benchmarks.

54 citations


Network Information
Related Topics (5)
Software
130.5K papers, 2M citations
76% related
Combustion
172.3K papers, 1.9M citations
72% related
Cluster analysis
146.5K papers, 2.9M citations
72% related
Cloud computing
156.4K papers, 1.9M citations
71% related
Hydrogen
132.2K papers, 2.5M citations
69% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202210
2021429
2020525
2019661
2018758
2017683