scispace - formally typeset
Search or ask a question
Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.
Abstract: Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

32 citations

Journal ArticleDOI
TL;DR: This paper proposes a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for R DDs, and puts forward a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the numberof use for partitions, and the sizes of the partition.
Abstract: As a parallel computation framework, Spark can cache repeatedly Resilient Distribution Datasets (RDDs) partitions in different nodes to speed up the process of computation. However, Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory. In this paper, we propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs. Our selection algorithm speeds up iterative computations. Nevertheless, when lots of new RDDs are chosen to cache their partitions in memory while limited memory has been full of them, the system will adopt the Least Recently Used (LRU) replacement algorithm. However, the LRU algorithm only considers whether the RDDs partitions are recently used while ignoring other factors such as the computation cost and so on. We also put forward a novel replacement algorithm called Weight Replacement (WR) algorithm which takes comprehensive consideration of the partitions computation cost, the numbers of use for partitions, and the sizes of the partitions. Experiment results show that Spark with our selection algorithm calculates faster than without the algorithm, and we find that Spark with WR algorithm shows better performance. Copyright c ⃝ 2013 John Wiley & Sons, Ltd.

32 citations

Proceedings ArticleDOI
TL;DR: In this article, a closed-form model of job execution on Apache Spark, a popular parallel processing engine, is presented, which can be used to estimate the completion time of a given Spark job on a cloud, with respect to the size of the input dataset, the number of iterations, etc.
Abstract: We present OptEx, a closed-form model of job execution on Apache Spark, a popular parallel processing engine. To the best of our knowledge, OptEx is the first work that analytically models job completion time on Spark. The model can be used to estimate the completion time of a given Spark job on a cloud, with respect to the size of the input dataset, the number of iterations, the number of nodes comprising the underlying cluster. Experimental results demonstrate that OptEx yields a mean relative error of 6% in estimating the job completion time. Furthermore, the model can be applied for estimating the cost optimal cluster composition for running a given Spark job on a cloud under a completion deadline specified in the SLO (i.e., Service Level Objective). We show experimentally that OptEx is able to correctly estimate the cost optimal cluster composition for running a given Spark job under an SLO deadline with an accuracy of 98%.

32 citations

Journal ArticleDOI
TL;DR: In this article, Allibone and Meek used the rotating camera to photograph the long electric spark and preliminary results have already been communicated, and the results given in the two preceding papers have been extended.
Abstract: The development of the short electric spark has been extensively studied by various scientists who used the Kerr electro-optical shutter, which enables the spark to be photographed at short intervals after its initiation. The longest sparks so studied have been of the order of a few centimetres, and the gap has usually been between parallel plate electrodes. The spark has been shown to start sometimes at the electrodes, and sometimes in the mid-gap region. The cloud chamber has also been used by others to study the pre-discharge and the spark between dissimilar electrodes and between parallel plate electrodes: between dissimilar electrodes the discharge generally starts from the smaller electrode: between parallel plate electrodes the same results were obtained as had been found by the Kerr cell technique. The rotating camera has been used by the authors to photograph the long electric spark and preliminary results have already been communicated (Allibone and Meek 1937; Allibone 1938). The present paper extends the results given in the two preceding papers.

32 citations


Network Information
Related Topics (5)
Software
130.5K papers, 2M citations
76% related
Combustion
172.3K papers, 1.9M citations
72% related
Cluster analysis
146.5K papers, 2.9M citations
72% related
Cloud computing
156.4K papers, 1.9M citations
71% related
Hydrogen
132.2K papers, 2.5M citations
69% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202210
2021429
2020525
2019661
2018758
2017683