Home
/
Topics
/
Spark (mathematics)

Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.

...read moreread less

Papers published on a yearly basis

2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1975
1974
1973
1972
1971
1970
1969
1968

1 / 3

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

CompStor: An In-storage Computation Platform for Scalable Distributed Processing

[...]

Mahdi Torabzadehkashi¹, Siavash Rezaei¹, Vladimir Alves, Nader Bagherzadeh¹•Institutions (1)

University of California, Irvine¹

21 May 2018

TL;DR: The 24TB CompStor SSD is the first one capable of supporting in-storage computation running an operating system, enabling all types of applications and Linux shell commands to be executed in-place with no modification.

...read moreread less

Abstract: The explosion of data-centric and data dependent applications requires new storage devices, interfaces, and software stacks. Big data analytics solutions such as Hadoop, MapReduce and Spark have addressed the performance challenge by using a distributed architecture based on a new paradigm that relies on moving computation closer to data. In this paper, we describe a novel approach aimed at pushing the "move computation to data" paradigm to its ultimate limit by enabling highly efficient and flexible in-storage processing capability in solid state drives (SSDs). We have designed CompStor, an FPGA-based SSD that implements computational storage through a software stack (devices, protocol, interface, software, and systems) and a dedicated hardware for in-storage processing including a quad-core ARM processor subsystem. The dedicated hardware resources provide in-storage data analytics capability without degrading the performance of common storage device functions such as read, write and trim. Experimental results show up to 3X energy saving for some applications in comparison to the host CPU. To the best of our knowledge, the 24TB CompStor SSD is the first one capable of supporting in-storage computation running an operating system, enabling all types of applications and Linux shell commands to be executed in-place with no modification.

...read moreread less

22 citations

Journal Article•DOI•

How managers can spark creativity

[...]

Dorothy A. Leonard, Walter Swap

01 Sep 1999-Leader To Leader

22 citations

Proceedings Article•DOI•

Micro-Architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

[...]

Ahsan Javed Awan¹, Mats Brorsson¹, Vladimir Vlassov¹, Eduard Ayguadé²•Institutions (2)

Royal Institute of Technology¹, Barcelona Supercomputing Center²

01 Oct 2016

TL;DR: In evaluation experiments, it is found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only, and Spark workloads using DataFrames have improved instruction retirement over workloadsusing RDDs.

...read moreread less

Abstract: While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.

...read moreread less

22 citations

Posted Content•

DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters

[...]

Hanjoo Kim, Jaehong Park, Jaehee Jang, Sungroh Yoon

26 Feb 2016-arXiv: Learning

TL;DR: To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe/Tensorflow-running nodes using Spark, and iteratively aggregates training results by a novel lock-free asynchronous variant of the popular elastic averaging stochastic gradient descent based update scheme, effectively complementing the synchronized processing capabilities of Spark.

...read moreread less

Abstract: The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms and GPGPU-based acceleration provide a mainstream solution to this computational challenge. In this paper, we propose DeepSpark, a distributed and parallel deep learning framework that exploits Apache Spark on commodity clusters. To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe/Tensorflow-running nodes using Spark, and iteratively aggregates training results by a novel lock-free asynchronous variant of the popular elastic averaging stochastic gradient descent based update scheme, effectively complementing the synchronized processing capabilities of Spark. DeepSpark is an on-going project, and the current release is available at this http URL.

...read moreread less

22 citations

Proceedings Article•DOI•

Machines Tuning Machines: Configuring Distributed Stream Processors with Bayesian Optimization

[...]

Lorenz Fischer¹, Shen Gao¹, Abraham Bernstein¹•Institutions (1)

University of Zurich¹

08 Sep 2015

TL;DR: In an extensive empirical evaluation, it is shown that Bayesian Optimization can effectively find good parameter settings for four different stream processing topologies implemented in Apache Storm resulting in significant gains over a parallel linear approach.

...read moreread less

Abstract: Modern distributed computing frameworks such as Apache Hadoop, Spark, or Storm distribute the workload of applications across a large number of machines. Whilst they abstract the details of distribution they do require the programmer to set a number of configuration parameters before deployment. These parameter settings (usually) have a substantial impact on execution efficiency. Finding the right values for these parameters is considered a difficult task and requires domain, application, and framework expertise. In this paper, we propose a machine learning approach to the problem of configuring a distributed computing framework. Specifically, we propose using Bayesian Optimization to find good parameter settings. In an extensive empirical evaluation, we show that Bayesian Optimization can effectively find good parameter settings for four different stream processing topologies implemented in Apache Storm resulting in significant gains over a parallel linear approach.

...read moreread less

22 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
…
129
130
131
132
133
134
135
…
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

Network Information

Performance

Metrics

7,304

Papers

74,604

Citations

No. of papers in the topic in previous years
Year	Papers
2022	10
2021	429
2020	525
2019	661
2018	758
2017	683

Spark (mathematics)

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics