scispace - formally typeset
Search or ask a question
Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.


Papers
More filters
Proceedings ArticleDOI
21 May 2018
TL;DR: The 24TB CompStor SSD is the first one capable of supporting in-storage computation running an operating system, enabling all types of applications and Linux shell commands to be executed in-place with no modification.
Abstract: The explosion of data-centric and data dependent applications requires new storage devices, interfaces, and software stacks. Big data analytics solutions such as Hadoop, MapReduce and Spark have addressed the performance challenge by using a distributed architecture based on a new paradigm that relies on moving computation closer to data. In this paper, we describe a novel approach aimed at pushing the "move computation to data" paradigm to its ultimate limit by enabling highly efficient and flexible in-storage processing capability in solid state drives (SSDs). We have designed CompStor, an FPGA-based SSD that implements computational storage through a software stack (devices, protocol, interface, software, and systems) and a dedicated hardware for in-storage processing including a quad-core ARM processor subsystem. The dedicated hardware resources provide in-storage data analytics capability without degrading the performance of common storage device functions such as read, write and trim. Experimental results show up to 3X energy saving for some applications in comparison to the host CPU. To the best of our knowledge, the 24TB CompStor SSD is the first one capable of supporting in-storage computation running an operating system, enabling all types of applications and Linux shell commands to be executed in-place with no modification.

22 citations

Proceedings ArticleDOI
01 Oct 2016
TL;DR: In evaluation experiments, it is found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only, and Spark workloads using DataFrames have improved instruction retirement over workloadsusing RDDs.
Abstract: While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.

22 citations

Posted Content
TL;DR: To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe/Tensorflow-running nodes using Spark, and iteratively aggregates training results by a novel lock-free asynchronous variant of the popular elastic averaging stochastic gradient descent based update scheme, effectively complementing the synchronized processing capabilities of Spark.
Abstract: The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data processing pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms and GPGPU-based acceleration provide a mainstream solution to this computational challenge. In this paper, we propose DeepSpark, a distributed and parallel deep learning framework that exploits Apache Spark on commodity clusters. To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe/Tensorflow-running nodes using Spark, and iteratively aggregates training results by a novel lock-free asynchronous variant of the popular elastic averaging stochastic gradient descent based update scheme, effectively complementing the synchronized processing capabilities of Spark. DeepSpark is an on-going project, and the current release is available at this http URL.

22 citations

Proceedings ArticleDOI
08 Sep 2015
TL;DR: In an extensive empirical evaluation, it is shown that Bayesian Optimization can effectively find good parameter settings for four different stream processing topologies implemented in Apache Storm resulting in significant gains over a parallel linear approach.
Abstract: Modern distributed computing frameworks such as Apache Hadoop, Spark, or Storm distribute the workload of applications across a large number of machines. Whilst they abstract the details of distribution they do require the programmer to set a number of configuration parameters before deployment. These parameter settings (usually) have a substantial impact on execution efficiency. Finding the right values for these parameters is considered a difficult task and requires domain, application, and framework expertise. In this paper, we propose a machine learning approach to the problem of configuring a distributed computing framework. Specifically, we propose using Bayesian Optimization to find good parameter settings. In an extensive empirical evaluation, we show that Bayesian Optimization can effectively find good parameter settings for four different stream processing topologies implemented in Apache Storm resulting in significant gains over a parallel linear approach.

22 citations


Network Information
Related Topics (5)
Software
130.5K papers, 2M citations
76% related
Combustion
172.3K papers, 1.9M citations
72% related
Cluster analysis
146.5K papers, 2.9M citations
72% related
Cloud computing
156.4K papers, 1.9M citations
71% related
Hydrogen
132.2K papers, 2.5M citations
69% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202210
2021429
2020525
2019661
2018758
2017683