Topic
Spark (mathematics)
About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.
Papers published on a yearly basis
Papers
More filters
••
TL;DR: This paper proposes adaptive performance model, which can dynamically scale up and down Apache Spark Streaming platform on the AWS, and existing models cannot give us such possibility.
20 citations
••
14 May 2017TL;DR: This paper proposes a new algorithm, called RF-GraP, which provides a more efficient correlation over distributed systems and is able to achieve significant performance speedups with obviously less network communication.
Abstract: Event correlation is a cornerstone for process discovery over event logs crossing multiple data sources. The computed correlation rules and process instances will greatly help us to unleash the power of process mining. However, exploring all possible event correlations over a log could be time consuming, especially when the log is large. State-of-the-art methods based on MapReduce designed to handle this challenge have offered significant performance improvements over standalone implementations. However, all existing techniques are still based on a conventional generating-and-pruning scheme. Therefore, event partitioning across multiple machines is often inefficient. In this paper, following the principle of filtering-and-verification, we propose a new algorithm, called RF-GraP, which provides a more efficient correlation over distributed systems. We present the detailed implementation of our approach and conduct a quantitative evaluation using the Spark platform. Experimental results demonstrate that the proposed method is indeed efficient. Compared to the state-of-the-art, we are able to achieve significant performance speedups with obviously less network communication.
20 citations
••
TL;DR: A distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat is proposed and implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing.
Abstract: Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.
20 citations
•
TL;DR: Cuttlefish is a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques.
Abstract: Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish, a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques. We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins. Our experiments show Cuttlefish-based adaptive convolution and regular expression operators can reach 72-99% of the throughput of an all-knowing oracle that always selects the optimal algorithm, even when individual physical operators are up to 105x slower than the optimal. Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer.
19 citations