scispace - formally typeset
Search or ask a question
Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.


Papers
More filters
Book
25 May 2017
TL;DR: This practical book describes techniques that can reduce data infrastructure costs and developer hours and demonstrates performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.
Abstract: Apache Spark is amazing when everything clicks. But if you havent seen the performance improvements you expected, or still dont feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources. Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, youll also learn how to make it sing. With this book, youll explore: How Spark SQLs new interfaces improve performance over SQLs RDD data structure The choice between data joins in Core Spark and Spark SQL Techniques for getting the most out of standard RDD transformations How to work around performance issues in Sparks key/value pair paradigm Writing high-performance Spark code without Scala or the JVM How to test for functionality and performance when applying suggested improvements Using Spark MLlib and Spark ML machine learning libraries Sparks Streaming components and external community packages

50 citations

DissertationDOI
11 Jun 2010
TL;DR: In this article, a two-dimensional model of spark discharge in air and spark ignition was developed using the non-reactive and reactive Navier-Stokes equations, and methods for calculating effective one-step parameters were developed using constant pressure explosion theory.
Abstract: Determining the risk of accidental ignition of flammable mixtures is a topic of tremendous importance in industry and aviation safety. The concept of minimum ignition energy (MIE) has traditionally formed the basis for studying ignition hazards of fuels. However, in recent years, particularly in the aviation safety industry, the viewpoint has changed to one where ignition is statistical in nature. Approaching ignition as statistical rather than a threshold phenomenon appears to be more consistent with the inherent variability in the engineering test data. Ignition tests were performed in lean hydrogen-based aviation test mixtures and in two hexane-air mixtures using low-energy capacitive spark ignition systems. Tests were carried out using both short, fixed sparks (1 to 2 mm) and variable length sparks up to 10 mm. The results were analyzed using statistical tools to obtain probability distributions for ignition versus spark energy and spark energy density (energy per unit spark length). Results show that a single threshold MIE value does not exist, and that the energy per unit length may be a more appropriate parameter for quantifying the risk of ignition than only the energy. The probability of ignition versus spark charge was also investigated, and the statistical results for the spark charge and spark energy density were compared. It was found that the test results were less variable with respect to the spark charge than the energy density. However, variability was still present due to phenomena such as plasma instabilities and cathode effects that are caused by the electrodynamics. Work was also done to develop a two-dimensional numerical model of spark ignition that accurately simulates all physical scales of the fluid mechanics and chemistry. In this work a two-dimensional model of spark discharge in air and spark ignition was developed using the non-reactive and reactive Navier-Stokes equations. One-step chemistry models were used to allow for highly resolved simulations, and methods for calculating effective one-step parameters were developed using constant pressure explosion theory. The one-step model was tuned to accurately simulate the flame speed, temperature, and straining behavior using one-dimensional flame computations. The simulations were performed with three different electrode geometries to investigate the effect of the geometry on the fluid mechanics of the evolving spark kernel and on flame formation. The computational results were compared with high-speed schlieren visualization of spark and ignition kernels. It was found that the electrode geometry had a significant effect on the fluid motion following spark discharge and hence influences the ignition process.

50 citations

Proceedings ArticleDOI
26 Jun 2016
TL;DR: This work proposes a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics).
Abstract: In recent years, our customers have expressed frustration in the traditional approach of using a combination of disparate products to handle their streaming, transactional and analytical needs. The common practice of stitching heterogeneous environments in custom ways has caused enormous production woes by increasing development complexity and total cost of ownership. With SnappyData, an open source platform, we propose a unified engine for real-time operational analytics, delivering stream analytics, OLTP and OLAP in a single integrated solution. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire (as an in-memory transactional store with scale-out SQL semantics). In this demonstration, after presenting a few use case scenarios, we exhibit SnappyData as our our in-memory solution for delivering truly interactive analytics (i.e., a couple of seconds), when faced with large data volumes or high velocity streams. We show that SnappyData can exploit state-of-the-art approximate query processing techniques and a variety of data synopses. Finally, we allow the audience to define various high-level accuracy contracts (HAC), to communicate their accuracy requirements with SnappyData in an intuitive fashion.

50 citations

Journal ArticleDOI
TL;DR: This paper proposes the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data, and proposes an efficient implementation of the discussed algorithm on Apache Spark.
Abstract: Despite more than two decades of progress, learning from imbalanced data is still considered as one of the contemporary challenges in machine learning. This has been further complicated by the advent of the big data era, where popular algorithms dedicated to alleviating the class skew impact are no longer feasible due to the volume of datasets. Additionally, most of existing algorithms focus on binary imbalanced problems, where majority and minority classes are well-defined. Multi-class imbalanced data poses further challenges as the relationship between classes is much more complex and simple decomposition into a number of binary problems leads to a significant loss of information. In this paper, we propose the first compound framework for dealing with multi-class big data problems, addressing at the same time the existence of multiple classes and high volumes of data. We propose to analyze the instance-level difficulties in each class, leading to understanding what causes learning difficulties. We embed this information in popular resampling algorithms which allows for informative balancing of multiple classes. We propose an efficient implementation of the discussed algorithm on Apache Spark, including a novel version of SMOTE that overcomes spatial limitations in distributed environments of its predecessor. Extensive experimental study shows that using instance-level information significantly improves learning from multi-class imbalanced big data. Our framework can be downloaded from https://github.com/fsleeman/minority-type-imbalanced .

50 citations


Network Information
Related Topics (5)
Software
130.5K papers, 2M citations
76% related
Combustion
172.3K papers, 1.9M citations
72% related
Cluster analysis
146.5K papers, 2.9M citations
72% related
Cloud computing
156.4K papers, 1.9M citations
71% related
Hydrogen
132.2K papers, 2.5M citations
69% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202210
2021429
2020525
2019661
2018758
2017683