scispace - formally typeset
Search or ask a question
Topic

Spark (mathematics)

About: Spark (mathematics) is a research topic. Over the lifetime, 7304 publications have been published within this topic receiving 63322 citations.


Papers
More filters
Posted Content
TL;DR: DeepSpark is proposed, a distributed and parallel deep learning framework that simultaneously exploits Apache Spark for large-scale distributed data management and Caffe for GPU-based acceleration.
Abstract: The increasing complexity of deep neural networks (DNNs) has made it challenging to exploit existing large-scale data process pipelines for handling massive data and parameters involved in DNN training. Distributed computing platforms and GPGPU-based acceleration provide a mainstream solution to this computational challenge. In this paper, we propose DeepSpark, a distributed and parallel deep learning framework that simultaneously exploits Apache Spark for large-scale distributed data management and Caffe for GPU-based acceleration. DeepSpark directly accepts Caffe input specifications, providing seamless compatibility with existing designs and network structures. To support parallel operations, DeepSpark automatically distributes workloads and parameters to Caffe-running nodes using Spark and iteratively aggregates training results by a novel lock-free asynchronous variant of the popular elastic averaging stochastic gradient descent (SGD) update scheme, effectively complementing the synchronized processing capabilities of Spark. DeepSpark is an on-going project, and the current release is available at this http URL

37 citations

Journal ArticleDOI
TL;DR: This paper proposes and evaluates cloud services for high resolution video streams in order to perform line detection using Canny edge detection followed by Hough transform in Hadoop and Spark and demonstrates the effectiveness of parallel implementation of computer vision algorithms to achieve good scalability for real-world applications.
Abstract: Nowadays, video cameras are increasingly used for surveillance, monitoring, and activity recording. These cameras generate high resolution image and video data at large scale. Processing such large scale video streams to extract useful information with time constraints is challenging. Traditional methods do not offer scalability to process large scale data. In this paper, we propose and evaluate cloud services for high resolution video streams in order to perform line detection using Canny edge detection followed by Hough transform. These algorithms are often used as preprocessing steps for various high level tasks including object, anomaly, and activity recognition. We implement and evaluate both Canny edge detector and Hough transform algorithms in Hadoop and Spark. Our experimental evaluation using Spark shows an excellent scalability and performance compared to Hadoop and standalone implementations for both Canny edge detection and Hough transform. We obtained a speedup of 10.8$$\times$$ and 9.3$$\times$$ for Canny edge detection and Hough transform respectively using Spark. These results demonstrate the effectiveness of parallel implementation of computer vision algorithms to achieve good scalability for real-world applications.

37 citations

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the effective energy from spark discharge for direct blast initiation of spherical gaseous detonations using a piezoelectric pressure transducer.
Abstract: In this study, effective energy from spark discharge for direct blast initiation of spherical gaseous detonations is investigated. In the experiment, direct initiation of detonation is achieved via a spark discharge from a high-voltage and low-inductance capacitor bank and the spark energy is estimated from the analysis of the current output. To determine the blast wave energy from the powerful spark, the time-of-arrival of the blast wave in air is measured at different radii using a piezoelectric pressure transducer. Good agreement is found in the scaled blast trajectories, i.e., scaled time c o·t/R o where c o is the ambient sound speed, as a function of blast radius R s/R o between the numerical simulation of a spherical blast wave from a point energy source and the experimental results where the explosion length scale R o is computed using the equivalent spark energy from the first 1/4 current discharge cycle. Alternatively, by fitting the experimental trajectories data, the blast energy estimated from the numerical simulation appears also in good agreement with that obtained experimentally using the 1/4 cycle criterion. Using the 1/4 cycle of spark discharge for the effective energy, direct initiation experiments of spherical gaseous detonations are carried out to determine the critical initiation energy in C2H2–2.5O2 mixtures with 70 and 0% argon dilution. The experimental results obtained from the 1/4 cycle of spark discharge agree well with the prediction from two initiation models, namely, the Lee’s surface energy model and a simplified work done model. The main source of discrepancy in the comparison can be explained by the uncertainty of cell size measurement which is needed for both the semi-empirical models.

37 citations

Journal ArticleDOI
TL;DR: This paper presents a completely redesigned distributed version of the popular ReliefF algorithm based on the novel Spark cluster computing model that is called DiReliefF and can process large volumes of data in a scalable way with much better processing times and memory usage.
Abstract: Feature selection (FS) is a key research area in the machine learning and data mining fields, removing irrelevant and redundant features usually helps to reduce the effort required to process a dataset while maintaining or even improving the processing algorithm's accuracy. However, traditional algorithms designed for executing on a single machine lack scalability to deal with the increasing amount of data that has become available in the current Big Data era. ReliefF is one of the most important algorithms successfully implemented in many FS applications. In this paper, we present a completely redesigned distributed version of the popular ReliefF algorithm based on the novel Spark cluster computing model that we have called DiReliefF. Spark is increasing its popularity due to its much faster processing times compared with Hadoop's MapReduce model implementation. The effectiveness of our proposal is tested on four publicly available datasets, all of them with a large number of instances and two of them with also a large number of features. Subsets of these datasets were also used to compare the results to a non-distributed implementation of the algorithm. The results show that the non-distributed implementation is unable to handle such large volumes of data without specialized hardware, while our design can process them in a scalable way with much better processing times and memory usage.

37 citations

Journal ArticleDOI
TL;DR: The benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
Abstract: Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed Spark engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VariantSpark provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80 % faster than the Spark-based genome clustering approach, adam, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. The benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.

37 citations


Network Information
Related Topics (5)
Software
130.5K papers, 2M citations
76% related
Combustion
172.3K papers, 1.9M citations
72% related
Cluster analysis
146.5K papers, 2.9M citations
72% related
Cloud computing
156.4K papers, 1.9M citations
71% related
Hydrogen
132.2K papers, 2.5M citations
69% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202210
2021429
2020525
2019661
2018758
2017683