Home
/
Topics
/
Spark (mathematics)

Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.

...read moreread less

Papers published on a yearly basis

2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973
1972
1971
1970
1969
1968

1 / 3

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Adaptive performance model for dynamic scaling Apache Spark Streaming

[...]

Max Petrov¹, Nikolay Butakov¹, Denis A. Nasonov¹, Mikhail Melnik¹•Institutions (1)

Saint Petersburg State University of Information Technologies, Mechanics and Optics¹

01 Jan 2018-Procedia Computer Science

TL;DR: This paper proposes adaptive performance model, which can dynamically scale up and down Apache Spark Streaming platform on the AWS, and existing models cannot give us such possibility.

...read moreread less

20 citations

Proceedings Article•DOI•

Efficient Event Correlation over Distributed Systems

[...]

Long Cheng¹, Boudewijn F. van Dongen¹, Wil M. P. van der Aalst¹•Institutions (1)

Eindhoven University of Technology¹

14 May 2017

TL;DR: This paper proposes a new algorithm, called RF-GraP, which provides a more efficient correlation over distributed systems and is able to achieve significant performance speedups with obviously less network communication.

...read moreread less

Abstract: Event correlation is a cornerstone for process discovery over event logs crossing multiple data sources. The computed correlation rules and process instances will greatly help us to unleash the power of process mining. However, exploring all possible event correlations over a log could be time consuming, especially when the log is large. State-of-the-art methods based on MapReduce designed to handle this challenge have offered significant performance improvements over standalone implementations. However, all existing techniques are still based on a conventional generating-and-pruning scheme. Therefore, event partitioning across multiple machines is often inefficient. In this paper, following the principle of filtering-and-verification, we propose a new algorithm, called RF-GraP, which provides a more efficient correlation over distributed systems. We present the detailed implementation of our approach and conduct a quantitative evaluation using the Spark platform. Experimental results demonstrate that the proposed method is indeed efficient. Compared to the state-of-the-art, we are able to achieve significant performance speedups with obviously less network communication.

...read moreread less

20 citations

Journal Article•DOI•

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

[...]

Wen Xiao¹, Juan Hu•Institutions (1)

Hohai University¹

04 Feb 2020-The Journal of Supercomputing

TL;DR: A distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat is proposed and implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing.

...read moreread less

Abstract: Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.

...read moreread less

20 citations

Monograph•DOI•

Galvani’s Spark

[...]

Alan J. McComas

08 Aug 2011

19 citations

Posted Content•

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing.

[...]

Tomer Kaftan, Magdalena Balazinska, Alvin Cheung, Johannes Gehrke

26 Feb 2018-arXiv: Databases

TL;DR: Cuttlefish is a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques.

...read moreread less

Abstract: Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish, a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques. We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins. Our experiments show Cuttlefish-based adaptive convolution and regular expression operators can reach 72-99% of the throughput of an all-knowing oracle that always selects the optimal algorithm, even when individual physical operators are up to 105x slower than the optimal. Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer.

...read moreread less

19 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
…
151
152
153
154
155
156
157
…
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

Network Information

Performance

Metrics

7,304

Papers

74,604

Citations

No. of papers in the topic in previous years
Year	Papers
2022	10
2021	429
2020	525
2019	661
2018	758
2017	683

Spark (mathematics)

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics