scispace - formally typeset
Search or ask a question
Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This paper proposes adaptive performance model, which can dynamically scale up and down Apache Spark Streaming platform on the AWS, and existing models cannot give us such possibility.

20 citations

Proceedings ArticleDOI
14 May 2017
TL;DR: This paper proposes a new algorithm, called RF-GraP, which provides a more efficient correlation over distributed systems and is able to achieve significant performance speedups with obviously less network communication.
Abstract: Event correlation is a cornerstone for process discovery over event logs crossing multiple data sources. The computed correlation rules and process instances will greatly help us to unleash the power of process mining. However, exploring all possible event correlations over a log could be time consuming, especially when the log is large. State-of-the-art methods based on MapReduce designed to handle this challenge have offered significant performance improvements over standalone implementations. However, all existing techniques are still based on a conventional generating-and-pruning scheme. Therefore, event partitioning across multiple machines is often inefficient. In this paper, following the principle of filtering-and-verification, we propose a new algorithm, called RF-GraP, which provides a more efficient correlation over distributed systems. We present the detailed implementation of our approach and conduct a quantitative evaluation using the Spark platform. Experimental results demonstrate that the proposed method is indeed efficient. Compared to the state-of-the-art, we are able to achieve significant performance speedups with obviously less network communication.

20 citations

Journal ArticleDOI
Wen Xiao1, Juan Hu
TL;DR: A distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat is proposed and implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing.
Abstract: Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.

20 citations

MonographDOI
08 Aug 2011

19 citations

Posted Content
TL;DR: Cuttlefish is a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques.
Abstract: Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish, a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques. We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins. Our experiments show Cuttlefish-based adaptive convolution and regular expression operators can reach 72-99% of the throughput of an all-knowing oracle that always selects the optimal algorithm, even when individual physical operators are up to 105x slower than the optimal. Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer.

19 citations


Network Information
Related Topics (5)
Software
130.5K papers, 2M citations
76% related
Combustion
172.3K papers, 1.9M citations
72% related
Cluster analysis
146.5K papers, 2.9M citations
72% related
Cloud computing
156.4K papers, 1.9M citations
71% related
Hydrogen
132.2K papers, 2.5M citations
69% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202210
2021429
2020525
2019661
2018758
2017683