Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.

...read moreread less

Papers published on a yearly basis

1 / 3

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

[...]

Nasim Ahmed¹, Andre L. C. Barczak¹, Teo Susnjak¹, Mohammed A. Rashid¹•Institutions (1)

Massey University¹

17 Aug 2020-Journal of Big Data

TL;DR: Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

...read moreread less

Abstract: Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

...read moreread less

32 citations

Journal Article•DOI•

Selection and replacement algorithms for memory performance improvement in Spark

[...]

Mingxing Duan¹, Kenli Li¹, Zhuo Tang¹, Guoqing Xiao¹, Keqin Li², Keqin Li¹ - Show less +2 more•Institutions (2)

Hunan University¹, State University of New York System²

10 Jun 2016-Concurrency and Computation: Practice and Experience

TL;DR: This paper proposes a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for R DDs, and puts forward a novel replacement algorithm called weight replacement (WR) algorithm, which takes comprehensive consideration of the partitions computation cost, the numberof use for partitions, and the sizes of the partition.

...read moreread less

Abstract: As a parallel computation framework, Spark can cache repeatedly Resilient Distribution Datasets (RDDs) partitions in different nodes to speed up the process of computation. However, Spark does not have a good mechanism to select reasonable RDDs to cache their partitions in limited memory. In this paper, we propose a novel selection algorithm, by which Spark can automatically select the RDDs to cache their partitions in memory according to the number of use for RDDs. Our selection algorithm speeds up iterative computations. Nevertheless, when lots of new RDDs are chosen to cache their partitions in memory while limited memory has been full of them, the system will adopt the Least Recently Used (LRU) replacement algorithm. However, the LRU algorithm only considers whether the RDDs partitions are recently used while ignoring other factors such as the computation cost and so on. We also put forward a novel replacement algorithm called Weight Replacement (WR) algorithm which takes comprehensive consideration of the partitions computation cost, the numbers of use for partitions, and the sizes of the partitions. Experiment results show that Spark with our selection algorithm calculates faster than without the algorithm, and we find that Spark with WR algorithm shows better performance. Copyright c ⃝ 2013 John Wiley & Sons, Ltd.

...read moreread less

32 citations

Proceedings Article•DOI•

OptEx: A Deadline-Aware Cost Optimization Model for Spark

[...]

Subhajit Sidhanta¹, Wojciech Golab², Supratik Mukhopadhyay¹•Institutions (2)

Louisiana State University¹, University of Waterloo²

25 Mar 2016-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: In this article, a closed-form model of job execution on Apache Spark, a popular parallel processing engine, is presented, which can be used to estimate the completion time of a given Spark job on a cloud, with respect to the size of the input dataset, the number of iterations, etc.

...read moreread less

Abstract: We present OptEx, a closed-form model of job execution on Apache Spark, a popular parallel processing engine. To the best of our knowledge, OptEx is the first work that analytically models job completion time on Spark. The model can be used to estimate the completion time of a given Spark job on a cloud, with respect to the size of the input dataset, the number of iterations, the number of nodes comprising the underlying cluster. Experimental results demonstrate that OptEx yields a mean relative error of 6% in estimating the job completion time. Furthermore, the model can be applied for estimating the cost optimal cluster composition for running a given Spark job on a cloud under a completion deadline specified in the SLO (i.e., Service Level Objective). We show experimentally that OptEx is able to correctly estimate the cost optimal cluster composition for running a given Spark job under an SLO deadline with an accuracy of 98%.

...read moreread less

32 citations

Journal Article•DOI•

The development of the spark discharge. II

[...]

T.E. Allibone, J. M. Meek

22 Dec 1938-Proceedings of The Royal Society A: Mathematical, Physical and Engineering Sciences

TL;DR: In this article, Allibone and Meek used the rotating camera to photograph the long electric spark and preliminary results have already been communicated, and the results given in the two preceding papers have been extended.

...read moreread less

Abstract: The development of the short electric spark has been extensively studied by various scientists who used the Kerr electro-optical shutter, which enables the spark to be photographed at short intervals after its initiation. The longest sparks so studied have been of the order of a few centimetres, and the gap has usually been between parallel plate electrodes. The spark has been shown to start sometimes at the electrodes, and sometimes in the mid-gap region. The cloud chamber has also been used by others to study the pre-discharge and the spark between dissimilar electrodes and between parallel plate electrodes: between dissimilar electrodes the discharge generally starts from the smaller electrode: between parallel plate electrodes the same results were obtained as had been found by the Kerr cell technique. The rotating camera has been used by the authors to photograph the long electric spark and preliminary results have already been communicated (Allibone and Meek 1937; Allibone 1938). The present paper extends the results given in the two preceding papers.

...read moreread less

32 citations

Proceedings Article•DOI•

Improving the Efficiency of Conventional Spark-Ignition Engines Using Octane-on-Demand Combustion - Part II: Vehicle Studies and Life Cycle Assessment

[...]

Kai Morganti¹, Abdullah Alzubail¹, Marwan Abdullah¹, Yoann Viollet¹, Robert Head¹, Junseok Chang¹, Gautam Kalghatgi¹ - Show less +3 more•Institutions (1)

Saudi Aramco¹

05 Apr 2016

32 citations

Collapse

Network Information

Performance

Metrics

7,304

Papers

74,604

Citations

No. of papers in the topic in previous years
Year	Papers
2022	10
2021	429
2020	525
2019	661
2018	758
2017	683

Spark (mathematics)

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics