Showing papers on "Spark (mathematics) published in 2020"

PDF

Open Access

Book Chapter•DOI•

SPARK: Spatial-Aware Online Incremental Attack Against Visual Tracking

[...]

Qing Guo¹, Qing Guo², Xiaofei Xie², Felix Juefei-Xu³, Lei Ma⁴, Zhongguo Li¹, Wanli Xue⁵, Wei Feng¹, Yang Liu² - Show less +5 more•Institutions (5)

Tianjin University¹, Nanyang Technological University², Alibaba Group³, Kyushu University⁴, Tianjin University of Technology⁵

23 Aug 2020

TL;DR: In this article, the spatial-temporal sparse incremental perturbations are used to make the adversarial attack less perceptible. But, the work in this paper is different from previous work.

...read moreread less

Abstract: Adversarial attacks of deep neural networks have been intensively studied on image, audio, and natural language classification tasks. Nevertheless, as a typical while important real-world application, the adversarial attacks of online video tracking that traces an object’s moving trajectory instead of its category are rarely explored. In this paper, we identify a new task for the adversarial attack to visual tracking: online generating imperceptible perturbations that mislead trackers along with an incorrect (Untargeted Attack, UA) or specified trajectory (Targeted Attack, TA). To this end, we first propose a spatial-aware basic attack by adapting existing attack methods, i.e., FGSM, BIM, and C&W, and comprehensively analyze the attacking performance. We identify that online object tracking poses two new challenges: 1) it is difficult to generate imperceptible perturbations that can transfer across frames, and 2) real-time trackers require the attack to satisfy a certain level of efficiency. To address these challenges, we further propose the spatial-aware online inc remental attac k (a.k.a. SPARK) that performs spatial-temporal sparse incremental perturbations online and makes the adversarial attack less perceptible. In addition, as an optimization-based method, SPARK quickly converges to very small losses within several iterations by considering historical incremental perturbations, making it much more efficient than basic attacks. The in-depth evaluation of the state-of-the-art trackers (i.e., SiamRPN++ with AlexNet, MobileNetv2, and ResNet-50, and SiamDW) on OTB100, VOT2018, UAV123, and LaSOT demonstrates the effectiveness and transferability of SPARK in misleading the trackers under both UA and TA with minor perturbations.

...read moreread less

56 citations

Journal Article•DOI•

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

[...]

Herodotos Herodotou¹, Yuxing Chen², Jiaheng Lu²•Institutions (2)

Cyprus University of Technology¹, University of Helsinki²

26 Apr 2020-ACM Computing Surveys

TL;DR: This work investigates existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation- based, experiment-driven, machine learning, and adaptive tuning.

...read moreread less

Abstract: Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.

...read moreread less

55 citations

Proceedings Article•DOI•

JUST: JD Urban Spatio-Temporal Data Engine

[...]

Ruiyuan Li¹, Huajun He², Rubin Wang², Huang Yuchuan, Junwen Liu, Sijie Ruan¹, Tianfu He³, Jie Bao, Yu Zheng¹ - Show less +5 more•Institutions (3)

Xidian University¹, Southwest Jiaotong University², Harbin Institute of Technology³

20 Apr 2020

TL;DR: This paper designs and implements a complete SQL engine, with which all operations can be performed through a SQL-like query language, i.e., JustQL, which can efficiently manage big spatio-temporal data in a convenient way and has a competitive query performance and is much more scalable than other distributed data management systems.

...read moreread less

Abstract: With the prevalence of positioning techniques, a prodigious number of spatio-temporal data is generated con-stantly. To effectively support sophisticated urban applications, e.g., location-based services, based on spatio-temporal data, it is desirable for an efficient, scalable, update-enabled, and easy-to-use spatio-temporal data management system.This paper presents JUST, i.e., JD Urban Spatio-Temporal data engine, which can efficiently manage big spatio-temporal data in a convenient way. JUST incorporates the distributed NoSQL data store, i.e., Apache HBase, as the underlying storage, GeoMesa as the spatio-temporal data indexing tool, and Apache Spark as the execution engine. We creatively design two indexing techniques, i.e., Z2T and XZ2T, which accelerates spatio-temporal queries tremendously. Furthermore, we introduce a compression mechanism, which not only greatly reduces the storage cost, but also improves the query efficiency. To make JUST easy-to-use, we design and implement a complete SQL engine, with which all operations can be performed through a SQL-like query language, i.e., JustQL. JUST also supports inherently new data insertions and historical data updates without index reconstruction. JUST is deployed as a PaaS in JD with multi-users support. Many applications have been developed based on the SDKs provided by JUST. Extensive experiments are carried out with six state-of-the-art distributed spatio-temporal data management systems based on two real datasets and one synthetic dataset. The results show that JUST has a competitive query performance and is much more scalable than them.

...read moreread less

50 citations

Journal Article•DOI•

A distributed computing framework for wind speed big data forecasting on Apache Spark

[...]

Yinan Xu¹, Hui Liu¹, Zhihao Long¹•Institutions (1)

Central South University¹

01 Feb 2020-Sustainable Energy Technologies and Assessments

TL;DR: The experimental results indicate that the proposed distributed computing framework on Spark can forecast wind speed big data in multi-step accurately and has a faster computation speed when processing big data, compared to the stand-alone method.

...read moreread less

45 citations

Journal Article•DOI•

Sports performance prediction model based on integrated learning algorithm and cloud computing Hadoop platform

[...]

Zhu Haiyun¹, Xu Yizhe²•Institutions (2)

East China Normal University¹, Central China Normal University²

01 Nov 2020-Microprocessors and Microsystems

TL;DR: The plan is to develop computational engine efficiency and improve rain prediction models successfully and effectively using big data and Hadoop learning, and the planned high timeliness and accuracy of real-time hurricane forecast with rain, can solve the problem.

...read moreread less

44 citations

Journal Article•DOI•

SmartSSD: FPGA Accelerated Near-Storage Data Analytics on SSD

[...]

Joo Hwan Lee¹, Hui Zhang¹, Veronica Lagrange¹, Praveen Krishnamoorthy¹, Xiaodong Zhao¹, Yang Seok Ki¹ - Show less +2 more•Institutions (1)

Samsung¹

01 Jul 2020-IEEE Computer Architecture Letters

TL;DR: A detailed model-based evaluation shows that SmartSSD has the potential to have a transformative impact when building a high performance data analytic system, which enables 3.04x performance improvement and consuming only 45.8 percent of energy compared to the conventional CPU-based approach.

...read moreread less

Abstract: Faced with the increasing disparity between SSD throughput and CPU-based compute capabilities, there have been growing interests to move compute closer to storage and accelerate the data analytic workloads. In this letter, we propose SmartSSD, an SSD with onboard FPGA, which enables offloading computation within SSD. We perform a detailed model-based evaluation to evaluate the end-to-end performance and energy benefit of SmartSSD for the representative data analytic workloads with Spark SQL and Parquet columnar data format. Our evaluation shows that SmartSSD has the potential to have a transformative impact when building a high performance data analytic system, which enables 3.04x performance improvement and consuming only 45.8 percent of energy compared to the conventional CPU-based approach.

...read moreread less

41 citations

Journal Article•DOI•

Towards Near-Real-Time Intrusion Detection for IoT Devices using Supervised Learning and Apache Spark

[...]

Valerio Morfino, Salvatore Rampone

01 Mar 2020-Electronics

TL;DR: A hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis performed in the Cloud.

...read moreread less

Abstract: In the fields of Internet of Things (IoT) infrastructures, attack and anomaly detection are rising concerns. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing proportionally. In this paper the performances of several machine learning algorithms in identifying cyber-attacks (namely SYN-DOS attacks) to IoT systems are compared both in terms of application performances, and in training/application times. We use supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. We show the implementation details and the performance of those algorithms on public datasets using a training set of up to 2 million instances. We adopt a Cloud environment, emphasizing the importance of the scalability and of the elasticity of use. Results show that all the Spark algorithms used result in a very good identification accuracy (>99%). Overall, one of them, Random Forest, achieves an accuracy of 1. We also report a very short training time (23.22 sec for Decision Tree with 2 million rows). The experiments also show a very low application time (0.13 sec for over than 600,000 instances for Random Forest) using Apache Spark in the Cloud. Furthermore, the explicit model generated by Random Forest is very easy-to-implement using high- or low-level programming languages. In light of the results obtained, both in terms of computation times and identification performance, a hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis (training) performed in the Cloud.

...read moreread less

39 citations

Journal Article•DOI•

News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark

[...]

Zhuo Zhou, Jiaohua Qin, Xuyu Xiang, Yun Tan, Qiang Liu, Neal N. Xiong - Show less +2 more

01 Jan 2020-Cmc-computers Materials & Continua

TL;DR: An optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored and the processing speed of LDA topic model clustering has been improved based Spark.

...read moreread less

Abstract: Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data, this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform. Since the TF-IDF (term frequency-inverse document frequency) algorithm under Spark is irreversible to word mapping, the mapped words indexes cannot be traced back to the original words. In this paper, an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored. Firstly, the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper, and then the features are inputted to the LDA (Latent Dirichlet Allocation) topic model for training. Finally, the text topic clustering is obtained. Experimental results show that for large data samples, the processing speed of LDA topic model clustering has been improved based Spark. At the same time, compared with the LDA topic model based on word frequency input, the model proposed in this paper has a reduction of perplexity.

...read moreread less

38 citations

Journal Article•DOI•

Canny edge detection and Hough transform for high resolution video streams using Hadoop and Spark

[...]

Bilal Iqbal¹, Waheed Iqbal¹, Nazar Khan¹, Arif Mahmood², Abdelkarim Erradi³ - Show less +1 more•Institutions (3)

College of Information Technology¹, Information Technology University², Qatar University³

01 Mar 2020-Cluster Computing

TL;DR: This paper proposes and evaluates cloud services for high resolution video streams in order to perform line detection using Canny edge detection followed by Hough transform in Hadoop and Spark and demonstrates the effectiveness of parallel implementation of computer vision algorithms to achieve good scalability for real-world applications.

...read moreread less

Abstract: Nowadays, video cameras are increasingly used for surveillance, monitoring, and activity recording. These cameras generate high resolution image and video data at large scale. Processing such large scale video streams to extract useful information with time constraints is challenging. Traditional methods do not offer scalability to process large scale data. In this paper, we propose and evaluate cloud services for high resolution video streams in order to perform line detection using Canny edge detection followed by Hough transform. These algorithms are often used as preprocessing steps for various high level tasks including object, anomaly, and activity recognition. We implement and evaluate both Canny edge detector and Hough transform algorithms in Hadoop and Spark. Our experimental evaluation using Spark shows an excellent scalability and performance compared to Hadoop and standalone implementations for both Canny edge detection and Hough transform. We obtained a speedup of 10.8$$\times$$ and 9.3$$\times$$ for Canny edge detection and Hough transform respectively using Spark. These results demonstrate the effectiveness of parallel implementation of computer vision algorithms to achieve good scalability for real-world applications.

...read moreread less

37 citations

Journal Article•DOI•

Fast and effective Big Data exploration by clustering

[...]

Michele Ianni¹, Elio Masciari, Giuseppe M. Mazzeo², Mario Mezzanzanica³, Carlo Zaniolo⁴ - Show less +1 more•Institutions (4)

University of Calabria¹, Facebook², University of Milano-Bicocca³, University of California, Los Angeles⁴

01 Jan 2020-Future Generation Computer Systems

TL;DR: By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion.

...read moreread less

35 citations

Journal Article•DOI•

Effect of rapid combustion on engine performance and knocking characteristics under different spark strategy conditions

[...]

Lin Chen¹, Jiaying Pan¹, Changwen Liu¹, Gequn Shu¹, Haiqiao Wei¹ - Show less +1 more•Institutions (1)

Tianjin University¹

01 Feb 2020-Energy

TL;DR: In this paper, a double-spider system was used for investigating the influence of rapid combustion on engine performance and knocking characteristics, and the results showed that under synchronous double spark ignition condition, output power and effective thermal efficiency are improved because of shortened combustion duration.

...read moreread less

Journal Article•DOI•

Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

[...]

Dongliang Xia¹, Feifei Ning¹, Weina He¹•Institutions (1)

Pingdingshan University¹

02 Jan 2020-Journal of Grid Computing

TL;DR: This paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method.

...read moreread less

Abstract: Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.

...read moreread less

Journal Article•DOI•

ScienceEarth: A Big Data Platform for Remote Sensing Data Processing

[...]

Chen Xu, Xiaoping Du, Zhenzhen Yan, Xiangtao Fan

01 Feb 2020-Remote Sensing

TL;DR: The result of tests proves that ScienceEarth can efficiently store, retrieve, and process remote sensing data.

...read moreread less

Abstract: Mass remote sensing data management and processing is currently one of the most important topics. In this study, we introduce ScienceEarth, a cluster-based data processing framework. The aim of ScienceEarth is to store, manage, and process large-scale remote sensing data in a cloud-based cluster-computing environment. The platform consists of the following three main parts: ScienceGeoData, ScienceGeoIndex, and ScienceGeoSpark. ScienceGeoData stores and manages remote sensing data. ScienceGeoIndex is an index and query system, a spatial index based on quad-tree and Hilbert curve which is combined for heterogeneous tiled remote sensing data that makes efficient data retrieval in ScienceGeoData. ScienceGeoSpark is an easy-to-use computing framework in which we use Apache Spark as the analytics engine for big remote sensing data processing. The result of tests proves that ScienceEarth can efficiently store, retrieve, and process remote sensing data. The results reveal ScienceEarth has the potential and capabilities of efficient big remote sensing data processing.

...read moreread less

Journal Article•DOI•

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

[...]

Nasim Ahmed¹, Andre L. C. Barczak¹, Teo Susnjak¹, Mohammed A. Rashid¹•Institutions (1)

Massey University¹

17 Aug 2020-Journal of Big Data

TL;DR: Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

...read moreread less

Abstract: Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

...read moreread less

Journal Article•DOI•

Apache Spark Accelerated Deep Learning Inference for Large Scale Satellite Image Analytics

[...]

Dalton Lunga¹, Jonathan Gerrand¹, Lexie Yang¹, Christopher Layton¹, Robert N. Stewart¹ - Show less +1 more•Institutions (1)

Oak Ridge National Laboratory¹

03 Jan 2020-IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

TL;DR: In this article, the authors implement a novel remote sensing data flow (RESFlow) for advancing machine learning to compute with massive amounts of remotely sensed imagery, where the core contribution is partitioning a massive amount of data into homogeneous distributions for fitting simple models.

...read moreread less

Abstract: The shear volumes of data generated from earth observation and remote sensing technologies continue to make major impact; leaping key geospatial applications into the dual data and compute-intensive era. As a consequence, this rapid advancement poses new computational and data processing challenges. We implement a novel remote sensing data flow (RESFlow) for advancing machine learning to compute with massive amounts of remotely sensed imagery. The core contribution is partitioning massive amounts of data into homogeneous distributions for fitting simple models. RESFlow takes advantage of Apache Spark and the availability of modern computing hardware to harness the acceleration of deep learning inference on expansive remote sensing imagery. The framework incorporates a strategy to optimize resource utilization across multiple executors assigned to a single worker. We showcase its deployment in both computationally and data-intensive workloads for pixel-level labeling tasks. The pipeline invokes deep learning inference at three stages; during deep feature extraction, deep metric mapping, and deep semantic segmentation. The tasks impose compute-intensive and GPU resource sharing challenges motivating for a parallelized pipeline for all execution steps. To address the problem of hardware resource contention, our containerized workflow further incorporates a novel GPU checkout routine and the ticketing system across multiple workers. The workflow is demonstrated with NVIDIA DGX accelerated platforms and offers appreciable compute speed-ups for deep learning inference on pixel labeling workloads; processing 21 028 TB of imagery data and delivering output maps at area rate of 5.245 sq.km/s, amounting to 453 168 sq.km/day—reducing a 28 day workload to 21 h.

...read moreread less

Journal Article•DOI•

Big data time series forecasting based on pattern sequence similarity and its application to the electricity demand

[...]

Rubén Pérez-Chacón¹, Gualberto Asencio-Cortés¹, Francisco Martínez-Álvarez¹, Alicia Troncoso¹•Institutions (1)

Pablo de Olavide University¹

01 Nov 2020-Information Sciences

TL;DR: This work proposes a novel algorithm to forecast big data time series, based on the well-established Pattern Sequence-based Forecasting algorithm, which uses the Apache Spark distributed computation framework and it is a ready-to-use application with few parameters to adjust.

...read moreread less

Journal Article•DOI•

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

[...]

Shashi Raj¹, Dharavath Ramesh¹, M. Sreenu², Krishan Kumar Sethi¹•Institutions (2)

Indian Institutes of Technology¹, National Institute of Technology, Hamirpur²

01 Sep 2020-Knowledge and Information Systems

TL;DR: The proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further, and reduces the size of the input dataset for higher iterations enables EAFIM to perform better.

...read moreread less

Abstract: Frequent itemset mining is considered a popular tool to discover knowledge from transactional datasets. It also serves as the basis for association rule mining. Several algorithms have been proposed to find frequent patterns in which the apriori algorithm is considered as the earliest proposed. Apriori has two significant bottlenecks associated with it: first, repeated scanning of input dataset and second, the requirement of generation of all the candidate itemsets before counting its support value. These bottlenecks reduce the effectiveness of apriori for large-scale datasets. Reasonable efforts have been made to diminish these bottlenecks so that efficiency can be improved. Especially, when the data size is larger, even distributed and parallel environments like MapReduce does not perform well due to the iterative nature of the algorithm that incurs high disk overhead. Apache Spark, on the other hand, is gaining significant attention in the field of big data processing because of its in-memory processing capabilities. Apart from utilizing the parallel and distributed computing environment of Spark, the proposed scheme named efficient apriori-based frequent itemset mining (EAFIM) presents two novel methods to improve the efficiency further. Unlike apriori, it generates the candidates ‘on-the-fly,’ i.e., candidate generation, and count of its support values go simultaneously when the input dataset is being scanned. Also, instead of using the original input dataset in each iteration, it calculates the updated input dataset by removing useless items and transactions. Reduction in size of the input dataset for higher iterations enables EAFIM to perform better. Extensive experiments were conducted to analyze the efficiency and scalability of EAFIM, which outperforms other existing methodologies.

...read moreread less

Posted Content•

Biomedical Named Entity Recognition at Scale

[...]

Veysel Kocaman, David Talby

12 Nov 2020-arXiv: Computation and Language

TL;DR: A single trainable NER model is presented that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT and can be extended to support other human languages with no code changes.

...read moreread less

Abstract: Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

...read moreread less

Journal Article•DOI•

A microgrid energy management system based on chance-constrained stochastic optimization and big data analytics

[...]

Carlos Marino¹, Mohammad Marufuzzaman²•Institutions (2)

Pontifical Catholic University of Peru¹, Mississippi State University²

01 May 2020-Computers & Industrial Engineering

TL;DR: This research proposes to use Apache Spark to enhance the performance of a scalable stochastic optimization model for an MG for multiple buildings, and to ensure that a significant portion of the wind power output will be utilized.

...read moreread less

Proceedings Article•DOI•

Cheetah: Accelerating Database Queries with Switch Pruning

[...]

Muhammad Tirmazi¹, Ran Ben Basat¹, Jiaqi Gao¹, Minlan Yu¹•Institutions (1)

Harvard University¹

11 Jun 2020

TL;DR: This paper proposes pruning algorithms for a variety of queries and implements the system, Cheetah, on a Barefoot Tofino switch and Spark to partially offload query computation to the switch.

...read moreread less

Abstract: Modern database systems are growing increasingly distributed and struggle to reduce query completion time with a large volume of data. In this paper, we leverage programmable switches in the network to partially offload query computation to the switch. While switches provide high performance, they have resource and programming constraints that make implementing diverse queries difficult. To fit in these constraints, we introduce the concept of data pruning -- filtering out entries that are guaranteed not to affect output. The database system then runs the same query but on the pruned data, which significantly reduces processing time. We propose pruning algorithms for a variety of queries. We implement our system, Cheetah, on a Barefoot Tofino switch and Spark. Our evaluation on multiple workloads shows 40 - 200% improvement in the query completion time compared to Spark.

...read moreread less

Journal Article•DOI•

A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

[...]

Shanjiang Tang¹, Bingsheng He², Ce Yu¹, Yusen Li³, K. Li¹ - Show less +1 more•Institutions (3)

Tianjin University¹, National University of Singapore², Nankai University³

24 Feb 2020-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark and introduces Spark programming model and computing system, and discusses the pros and cons.

...read moreread less

Abstract: With the explosive increase of big data, it is necessary to apply large-scale data processing systems to analysis Big Data. Arguably, Spark is state of the art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset(RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient(faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation on various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for Spark.

...read moreread less

Journal Article•DOI•

Artificial neural networks based techniques for anomaly detection in Apache Spark

[...]

Ahmad Alnafessah¹, Giuliano Casale¹•Institutions (1)

Imperial College London¹

01 Jun 2020-Cluster Computing

TL;DR: The results prove that the proposed method outperforms other methods, typically achieving 98–99% F-scores, and offering much greater accuracy than alternative techniques to detect both the period in which anomalies occurred and their type.

...read moreread less

Abstract: Late detection and manual resolutions of performance anomalies in Cloud Computing and Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose an artificial neural network based methodology for anomaly detection tailored to the Apache Spark in-memory processing platform. Apache Spark is widely adopted by industry because of its speed and generality, however there is still a shortage of comprehensive performance anomaly detection methods applicable to this platform. We propose an artificial neural networks driven methodology to quickly sift through Spark logs data and operating system monitoring metrics to accurately detect and classify anomalous behaviors based on the Spark resilient distributed dataset characteristics. The proposed method is evaluated against three popular machine learning algorithms, decision trees, nearest neighbor, and support vector machine, as well as against four variants that consider different monitoring datasets. The results prove that our proposed method outperforms other methods, typically achieving 98–99% F-scores, and offering much greater accuracy than alternative techniques to detect both the period in which anomalies occurred and their type.

...read moreread less

Journal Article•DOI•

Scalable Discovery of Hybrid Process Models in a Cloud Computing Environment

[...]

Long Cheng¹, Boudewijn F. van Dongen², Wil M. P. van der Aalst³•Institutions (3)

University College Dublin¹, Eindhoven University of Technology², RWTH Aachen University³

01 Mar 2020-IEEE Transactions on Services Computing

TL;DR: This work introduces a novel method, called f-HMD, which aims at scalable hybrid model discovery in a cloud computing environment and returns hybrid process models to bridge the gap between formal and informal models.

...read moreread less

Abstract: Process descriptions are used to create products and deliver services. To lead better processes and services, the first step is to learn a process model. Process discovery is such a technique which can automatically extract process models from event logs. Although various discovery techniques have been proposed, they focus on either constructing formal models which are very powerful but complex, or creating informal models which are intuitive but lack semantics. In this work, we introduce a novel method that returns hybrid process models to bridge this gap. Moreover, to cope with today's big event logs, we propose an efficient method, called f -HMD, aims at scalable hybrid model discovery in a cloud computing environment. We present the detailed implementation of our approach over the Spark framework, and our experimental results demonstrate that the proposed method is efficient and scalable.

...read moreread less

Journal Article•DOI•

Implementing a Deep Learning Model for Intrusion Detection on Apache Spark Platform

[...]

Mohamed Haggag¹, Mohsen M. Tantawy, Magdy S. El-Soudani²•Institutions (2)

Misr University for Science and Technology¹, Cairo University²

27 Aug 2020-IEEE Access

TL;DR: Different DL models for IDS on Apache Spark have been implemented and an enhanced model is used to improve attack detection accuracy and a computation delay comparison between Apache Spark and regular implementation is presented.

...read moreread less

Abstract: Internet evolution produced a connected world with a massive amount of data. This connectivity advantage came with the price of more complex and advanced attacks. Intrusion Detection System (IDS) is an essential component for security in modern networks. The IDS methodology is either signature-based detection or anomaly behavior detection. Recently, researchers adopted Deep Learning (DL) because it has a better performance than traditional machine learning algorithms. The use of DL to produce a model for the IDS may take a long time because of computation complexity and a large number of hyperparameters. Different DL models for IDS on Apache Spark have been implemented in this article. This article uses the famous Network Security Lab - Knowledge Discovery and Data Mining (NSL-KDD) dataset and presents a computation delay comparison between Apache Spark and regular implementation. Moreover, an enhanced model is used to improve attack detection accuracy.

...read moreread less

Journal Article•DOI•

Large-Scale Data Pollution with Apache Spark

[...]

Kai Hildebrandt¹, Fabian Panse¹, Niklas Wilcke¹, Norbert Ritter¹•Institutions (1)

University of Hamburg¹

01 Jun 2020-IEEE Transactions on Big Data

TL;DR: A new framework that can be used to pollute a clean, homogeneous and large data set from an arbitrary domain with duplicates, errors and inhomogeneities is described.

...read moreread less

Abstract: Because of the increasing volume of autonomously collected data objects, duplicate detection is an important challenge in today's data management. To evaluate the efficiency of duplicate detection algorithms with respect to big data, large test data sets are required. Existing test data generation tools, however, are either not able to produce large test data sets or are domain-dependent which limits their usefulness to a few cases. In this paper, we describe a new framework that can be used to pollute a clean, homogeneous and large data set from an arbitrary domain with duplicates, errors and inhomogeneities. To prove its concept, we implemented a prototype which is built upon the cluster computing framework Apache Spark and evaluate its performance in several experiments.

...read moreread less

Proceedings Article•DOI•

Performance Evaluation of Distributed Machine Learning for Load Forecasting in Smart Grids

[...]

Dabeeruddin Syed¹, Shady S. Refaat², Haitham Abu-Rub²•Institutions (2)

Texas A&M University¹, Texas A&M University at Qatar²

18 Jan 2020

TL;DR: MLib, Spark library for machine learning algorithms, is utilized for distributed computing and the obtained results show that Spark produces high accuracy while parallelizing the process of load forecasting in highly competent training and test times.

...read moreread less

Abstract: Load forecasting in smart grid is the process of predicting the amount of electrical power to meet the short, medium and long term demands. Accurate load forecasting helps electrical utilities to manage their energy production, operations, control and management. Most of the state-of-the-art forecasting methodologies utilize classical machine learning algorithms to predict the electrical load. There is a need that big data platforms and parallel distributed computing are utilized to their potential in the available solutions. In this paper, the Apache Spark and Apache Hadoop are utilized as big data platforms for distributed computing in order to predict the load using available big data. In this paper, MLib, Spark library for machine learning algorithms, is utilized for distributed computing. Using MLib allows testing the classic regression algorithms such as linear regression, generalized linear regression, decision tree, random forest and gradient-boosted trees in addition to survival regression and isotonic regression. The obtained results show that Spark produces high accuracy while parallelizing the process of load forecasting in highly competent training and test times. Actual big data are used in the load forecasting process.

...read moreread less

Journal Article•DOI•

Safety assessment of natural gas storage tank using similarity aggregation method based fuzzy fault tree analysis (SAM-FFTA) approach

[...]

Hailong Yin¹, Changhua Liu², Wei Wu², Ke Song², Dongpeng Liu¹, Yong Dan² - Show less +2 more•Institutions (2)

Xi'an Jiaotong University¹, Northwest University (China)²

01 Jul 2020-Journal of Loss Prevention in The Process Industries

TL;DR: A fuzzy fault tree analysis approach based on similarity aggregation method (SAM-FFTA) has been proposed that combines SAM with fuzzy set theory and can handled comprehensively diverse forms of opinions of different experts to obtain the probabilities of bottom events in fault tree.

...read moreread less

Abstract: Fault tree analysis (FTA) is an important method to analyze the failure causes of engineering systems and evaluate their safety and reliability. In practical application, the probabilities of bottom events in FTA are usually estimated according to the opinions of experts or engineers because it is difficult to obtain sufficient probability data of bottom events in fault tree. However, in many cases, there are many experts with different opinions or different forms of opinions. How to reasonably aggregate expert opinions is a challenge for the engineering application of fault tree method. In this study, a fuzzy fault tree analysis approach based on similarity aggregation method (SAM-FFTA) has been proposed. This method combines SAM with fuzzy set theory and can handled comprehensively diverse forms of opinions of different experts to obtain the probabilities of bottom events in fault tree. Finally, for verifying the applicability and flexibility of the proposed method, a natural gas spherical storage tank with a volume of 10,000 m3 was analyzed, and the importance of each bottom event was determined. The results show that flame, lightning spark, electrostatic spark, impact spark, mechanical breakdown and deformation/breakage have the most significant influence on the explosion of the natural gas spherical storage tank.

...read moreread less

Journal Article•DOI•

Parallel and distributed architecture of genetic algorithm on Apache Hadoop and Spark

[...]

Hao-Chun Lu, F. J. Hwang¹, Yao-Huei Huang²•Institutions (2)

University of Technology, Sydney¹, Fu Jen Catholic University²

26 Jun 2020-Applied Soft Computing

TL;DR: The presented GA parallelization architecture outperforms the state-of-the-art reference architectures according to the computational experiments where the testing instances of traveling salesman problems are employed.

...read moreread less

Journal Article•DOI•

Round Compression for Parallel Matching Algorithms

[...]

Artur Czumaj, Jakub Ła̧cki, Aleksander Madry, Slobodan Mitrovic, Krzysztof Onak, Piotr Sankowski - Show less +2 more

01 Jan 2020-SIAM Journal on Computing

TL;DR: This research presents a novel and scalable approaches called "Smart Cassandra Spark Integration (SCSI)” for solving the challenge of integrating NoSQL data stores to manage distributed systems.

...read moreread less

Abstract: For over a decade now we have been witnessing the success of massive parallel computation frameworks, such as MapReduce, Hadoop, Dryad, or Spark. Compared to the classic distributed algorithms or P...

...read moreread less

Journal Article•DOI•

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

[...]

Wen Xiao¹, Juan Hu•Institutions (1)

Hohai University¹

04 Feb 2020-The Journal of Supercomputing

TL;DR: A distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat is proposed and implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing.

...read moreread less

Abstract: Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.

...read moreread less

Collapse